How is std::string::size() implemented?

How is std::string::size() implemented? - c++

cout << sizeof(std::string) << endl;
The result is 8 on my 64-bit machine, which is the same as sizeof(char*), so I am assuming the string class stores only the char*. How, then, is the size function implemented? Is it using strlen (since it is not storing the actual size or the pointer to the ending byte)?
On this page, it shows the size function has a constant time-complexity, so I am confused. And on another page someone has a larger string size.
I am using GCC 4.7.1 on Fedora 64 bit.

There could be many explanations for that. Just because std::string happens to store a pointer and nothing else does not mean that this is necessarily char * pointer to the controlled sequence. Why did you jump to that conclusion?
It could easily turn out that your std::string is a PImpl-style wrapper for a pointer to some internal object that stores all internal household data, including the char * pointer, the length and whatever else is necessary. That way the internal object can be arbitrarily large, without having any effect on the size of std::string itself. For example, in order to facilitate fast reference-counted copying, in some implementations std::string might be implemented similarly to std::shared_ptr. I.e. std::string in that case would essentially become something like std::shared_ptr<std::string_impl> with added copy-on-write semantics.
The target "string implementation" object might even use "struct hack"-style approach to store the actual string, meaning that instead of storing char * pointer it might embed the entire string into itself at the end.

Looking at the doxygen docs for libstdc++:
_CharT* _M_p; // The actual data
Assuming std::basic_string<char>, _M_p is a char* pointer to the actual data, so that is why you are getting 8.
It even says:
Where the _M_p points to the first character in the string, and you
cast it to a pointer-to-_Rep and subtract 1 to get a pointer to the
header.
So, it hides a pointer to the actual representation (capacity, length, etc.) in a block of memory right before where the string data is stored.
Then, there is the following member function to get to the representation:
Rep* _M_rep() const
{ return &((reinterpret_cast<_Rep*> (_M_data()))[-1]); }
and then they call it like this _M_rep()->_M_length; to get the size for example.

Your assumption that std::string is char* is wrong. Here is one of q few possible implementations with sizeof(std::string)==sizeof(char*):
struct std::string
{
string_implementation
{
size_t size;
size_t buffer_size;
char_traits whatever;
char *buffer; // Here is your actual string!
};
string_implementation *ptr;
}

std::string is a typdef for std::basic_string<char>, and basic_string is defined (on my machine) in file /usr/include/c++/4.4/bits/basic_string.h. There's a lot of indirection in that file, but roughly speeking std::string stores a pointer to actual data
// Use empty-base optimization: http://www.cantrip.org/emptyopt.html
struct _Alloc_hider : _Alloc
{
_Alloc_hider(_CharT* __dat, const _Alloc& __a)
: _Alloc(__a), _M_p(__dat) { }
_CharT* _M_p; // The actual data.
};
and this is why you observed such behavior. This pointer might might be casted to obtain pointer to structure that describes the well-known string properties (located just in front of actual data):
struct _Rep_base
{
size_type _M_length;
size_type _M_capacity;
_Atomic_word _M_refcount;
};
_Rep* _M_rep() const
{ return &((reinterpret_cast<_Rep*> (_M_data()))[-1]); }

Related

Dynamic Memory for Char pointer Array

I'm still very new to C++, and I'm having issues with allocating heap memory.
This is what I have in my header file:
const int NUM_WORDS = 1253;
const int CHAR_SIZE = 256;
class CWords
{
public:
// constructor(s)
CWords();
// destructor
~CWords();
// public member functions
void ReadFile();
const char* GetRandomWord() const;
private:
char* m_words[NUM_WORDS]; // NUM_WORDS pointers to a char
int m_numWords; // Total words actually read from the file
};
I'm trying to allocate space in the implementation cpp file, but I can't (default constructor):
CWords::CWords()
{
m_numWords = 0;
m_words = new char[strlen(m_words) + 1]; // LINE 31
strcpy(m_words, "NULL"); // LINE 32
}
line 31 gives me:
cannot convert 'char**' to 'const char*' for argument '1' to 'size_t strlen(const char*)'
and line 32 gives me:
cannot convert 'char**' to 'char*' for argument '1' to 'char* strcpy(char*, const char*)'
I don't know what these errors mean.

The answer assumes that there is no strict requirement to force using C style arrays and strings.
As mentioned in the comments, even if you did manage the get rid of the bugs, this is not the recomended way to go in C++.
std::vector is the go-to container if you need a dynamic size array. std::array is for a static size one.
Also it is better to use std::string than C style strings. Doing so will save you the trouble of manual memory management (and the bugs that come with it).
Below is a modification of the header of your class:
#include <vector>
#include <string>
class CWords
{
public:
// constructor(s)
CWords();
// destructor
~CWords();
// public member functions
bool ReadFile();
std::string const & GetRandomWord() const;
private:
std::vector<std::string> m_words;
};
I'll leave the implementation for you.
Notes:
ReadFile returns a bool in my version (not void). Reading from a file may fail, and so it is better to return false to indicate an error.
If you ever consider to inherit from class CWord, you'd better make the destructor virtual. See here: Should every class have a virtual destructor?.

if m_words should represent multiple strings
In this case m_words is a static array with pointers pointing to several strings.
Hence it makes sense to initialize those pointers with NULL in your constructor (e.g. using a for loop for NUM_WORDS entries).
Beside that initialize the member m_numWords with 0. Obviously this member should contain the info how many entries of m_words are already filled.
if m_words should represent 1 string
Use char m_words[NUM_WORDS]; in your header file
then you don't need to dynamically allocate memory in the constructor (the memory is then part of every instance's size).
Or simply use a char* m_words; in your header file
and dynamically allocate as much chars as you need at a later time. This way you can free your string and allocate it with a different size whenever needed.
However in this case it is strongly recommended to initialize m_words with a NULL pointer in constructor.
Please also have a look at the strlen() description. It can only count the characters, that are already contained in your string.
Use sizeof() to measure the bytesize of static arrays.

String encryption function works with char[], but not a plain string

I'm using version xtea encryption from wikipedia that's written in C++. I wrote a function to encrypt a string
const char* charPtrDecrypt(const char* encString, int len, bool encrypt)
{
/********************************************************
* This currently uses a hard-coded key, but I'll implement
* a dynamic key based on string length or something.
*********************************************************/
unsigned int key[4] = { 0xB5D1, 0x22BA, 0xC2BC, 0x9A4E };
int n_blocks=len/BLOCK_SIZE;
if (len%BLOCK_SIZE != 0)
++n_blocks;
for (int i = 0; i < n_blocks; i++)
{
if (encrypt)
xtea::Encrypt(32, (uint32_t*)(encString + (i*BLOCK_SIZE)), key);
else
xtea::Decrypt(32, (uint32_t*)(encString + (i*BLOCK_SIZE)), key);
}
return encString;
}
It works when I supply a const char encString[] = "Hello, World!", but when I supply a raw string e.g. const char* a = charPtrDecrypt("Hello, World!", 14, true) It crashes.

There's an old saying (I know it's old, because I first posted it to Usenet around 1992 or so) that: "If you lie to the compiler, it will get its revenge." That's what's happening here.
Here:
const char* charPtrDecrypt(const char* encString, int len, bool encrypt)
...you promise that you will not modify the characters that encString points at. That's what the const says/means/does.
Here, however:
xtea::Encrypt(32, (uint32_t*)(encString + (i*BLOCK_SIZE)), key);
...you cast away that constness (cast to uint32_t *, with no const qualifier), and pass the pointer to a function that modifies the buffer it points at.
Then the compiler gets its revenge: it allows you to pass a pointer to data you can't modify, because you promise not to modify it--but then when you turn around and try to modify it anyway, your program crashes and burns because you try to modify read-only data.
This can be avoided in any number of ways. One would be to get away from the relatively low-level constructs you're using now, and pass/return std::strings instead of pointers to [const] char.
The code has still more problems than just that though. For one thing, it treats the input as a block of uint32_t items, and rounds its view of the length up to the next multiple of the size of a uint32_t (typically 4). Unfortunately, it doesn't actually change the size of the buffer, so even when the buffer is writable, it doesn't really work correctly--it still reads and writes beyond the end of the buffer.
Here again, std::string will be helpful: it lets us resize the string up to the correct size instead of just reading/writing past the end of the fixed-size buffer.
Along with that, there's a fact the compiler won't care about, but you (and any reader of this code) will (or at least should): the name of the function is misleading, and has parameters whose meaning isn't at all apparent--particularly the Boolean that governs whether to encrypt or decrypt. I'd advise using an enumeration instead, and renaming the function to something that can encompass either encryption or decryption:
Finally, I'd move the if statement that determines whether to encrypt or decrypt outside the loop, since we aren't going to change from one to the other as we process one input string.
Taking all those into account, we could end up with code something like this:
enum direction { ENCRYPT, DECRYPT };
std::string xtea_process(std::string enc_string, direction d) {
unsigned int key[4] = { 0xB5D1, 0x22BA, 0xC2BC, 0x9A4E };
size_t len = enc_string.size();
len += len % BLOCK_SIZE; // round up to next multiple of BLOCK_SIZE
enc_string.resize(len); // enlarge the string to that size, if necessary
if (direction == DECRYPT)
for (size_t i = 0; i < len; i+=BLOCK_SIZE)
xtea::Decrypt(32, reinterpret_cast<uint32_t *>(&encString[i]), key);
else
for (size_t i = 0; i < len; i += BLOCK_SIZE)
xtea::Encrypt(32, reinterpret_cast<uint32_t *>(&encString[i]), key);
}
return encString;
}
This does still leave (at least) one point that I haven't bothered to deal with: some machines may have stricter alignment requirements for a uint32_t than for char, and it's theoretically possible that the buffer used in a string won't meet those stricter alignment requirements. You could run into a situation where you need to copy the data out of the string, into a buffer that's properly aligned for uint32_t access, do the encryption/decryption, then copy the result back.

You pass a constant const char* to the function but cast it to a non-constant uint32_t*. I guess that xtea::Encrypt modifies the string buffer in place.
In the first version const char encString[] = "Hello, World!" the variable --while being const-- most likely lies on the stack which is modifiable. So it's "not nice" to remove the const, but it works.
In the second version you string most likely lies in a read-only data segment. So casting away const let's you call the Encrypt function, but crashes as soon as the function really tries to modify the string.

C++ string length in bytes

string str; str="hello"; str.length(); sizeof(str);
I see that str.length returns the length in bytes why sizeof(str) doesn't return the same?
Is there alternative in c++ to a c command which is strlen(str)? What is the alternative of this coomand in c++?
When I use winsock in the send function I return the length in bytes. What should I use?
str.length? Or sizeof(str)? Pr something else? Because I see they produce different results.

sizeof returns the size of the data structure, not the size of the data in contains.
length() returns the length of the string that str contains, and is the function you want
It might seem confusing because sizeof(char[30]) is 30, but that is because the size of the data structure is 30, and will remain 30 no matter what you put in it
The string is actually an extremely complicated structure, but suppose it was a simple class with a pointer and a length
class string
{
char *data;
int length;
};
then sizeof(string) would return:
The size of a char * pointer, possibly but not necessarily 4
plus the size of an int, possibly but not necessarily 4
So you might get a value of 8. What the value of data or length is has no effect on the size of the structure.

sizeof() is not really meant to be used on a string class. The string class doesn't store ONLY the string data; there would be no difference between its data and a C-style string; it has other stuff in it as well, which throws off sizeof(). To get the actual length of the characters in the string, use str.length().
Don't use the C strlen() on a C++ string object. Don't use sizeof() either. Use .length().

std::string in C++ is instantiated as a pointer to a string object, since a string may have varying length. What sizeof() is returning is the size of the pointer to the string object (which on a 32 bit machine will probably be 4)

Operator sizeof() returns size of given type or object in bytes. 'Type version' is quite simple to understand, bu with 'Object version' you need to rember one thing:
sizeof() looks only on type definition and deduces total size from size and number of its members (in general, polymorphic and multiple inherited types may have additional 'hidden' members).
In other words, let's assume we have:
struct A
{
int* p1;
char* p2;
};
As you can probably suspect, sizeof(A) will return 8 (as pointer is 4-byte type on most 32-bit systems). But, when you do something like this:
A a_1;
a_1.p1 = new int[64];
sizeof(a_1) will still return 8. That's because memory allocated by new and pointed by A's member, does not 'belong' to this object.
And that is why sizeof(str) and str.length() give different results. std::string allocates memory for chars on the heap (dynamically, via malloc()), so it doesn't change string's size.
So, if you want to send string via network, proper size is str.len() and data pointer can be retrieved by calling str.c_str().
I didn't understant part with "strlen(str) equivalent". In C++ there is also strlen() function, with the same prototype, working exactly in the same way. It simply requires const char*, so you cannot use it for std::string (but you can do strlen(str.c_str()), as std::string's internal string is guaranteed to be null-terminated). For std::string use .length() as you already did.

Why do I get "double free or corruption"?

I am trying to serialize a struct, but the program crashed with:
*** glibc detected *** ./unserialization: double free or corruption (fasttop): 0x0000000000cf8010 ***
#include <iostream>
#include <cstdlib>
#include <cstring>
struct Dummy
{
std::string name;
double height;
};
template<typename T>
class Serialization
{
public:
static unsigned char* toArray (T & t)
{
unsigned char *buffer = new unsigned char [ sizeof (T) ];
memcpy ( buffer , &t , sizeof (T) );
return buffer;
};
static T fromArray ( unsigned char *buffer )
{
T t;
memcpy ( &t , buffer , sizeof (T) );
return t;
};
};
int main ( int argc , char **argv )
{
Dummy human;
human.name = "Someone";
human.height = 11.333;
unsigned char *buffer = Serialization<Dummy>::toArray (human);
Dummy dummy = Serialization<Dummy>::fromArray (buffer);
std::cout << "Name:" << dummy.name << "\n" << "Height:" << dummy.height << std::endl;
delete buffer;
return 0;
}

I see two problems with this code:
You are invoking undefined behavior by memcpying a struct containing a std::string into another location. If you memcpy a class that isn't just a pure struct (for example, a std::string), it can cause all sorts of problems. In this particular case, I think that part of the problem might be that std::string sometimes stores an internal pointer to a buffer of characters containing the actual contents of the string. If you memcpy the std::string, you bypass the string's normal copy constructor that would duplicate the string. Instead, you now have two different instances of std::string sharing a pointer, so when they are destroyed they will both try to delete the character buffer, causing the bug you're seeing. There is no easy fix for this other than to not do what you're doing. It's just fundamentally unsafe.
You are allocating memory with new[], but deleting it with delete. You should use the array deleting operator delete[] to delete this memory, since using regular delete on it will result in undefined behavior, potentially causing this crash.
Hope this helps!

It's not valid to use memcpy() with a data element of type std::string (or really, any non-POD data type). The std::string class stores the actual string data in a dynamically-allocated buffer. When you memcpy() the contents of a std::string around, you are obliterating the pointers allocated internally and end up accessing memory that has been already freed.
You could make your code work by changing the declaration to:
struct Dummy
{
char name[100];
double height;
};
However, that has the disadvantages of a fixed size name buffer. If you want to maintain a dynamically sized name, then you will need to have a more sophisticated toArray and fromArray implementation that doesn't do straight memory copies.

You're copying the string's internal buffer in the toArray call. When deserializing with fromArray you "create" a second string in dummy, which thinks it owns the same buffer as human.

std::string probably contains a pointer to a buffer that contains the string data. When you call toArray (human), you're memcpy()'ing the Dummy class's string, including the pointer to the string's data. Then when you create a new Dummy object by memcpy()'ing directly into it, you've created a new string object with the same pointer to string data as the first object. Next thing you know, dummy gets destructed and the copy of the pointer gets destroyed, then human gets destructed and BAM, you got a double free.
Generally, copying objects using memcpy like this will lead to all sorts of problems, like the one you've seen. Its probably just going to be the tip of the ice berg. Instead, you might consider explicitly implementing some sort of marshalling function for each class you want to serialize.
Alternatively, you might look into json libraries for c++, which can serialize things into a convenient text based format. JSON protocols are commonly used with custom network protocols where you want to serialize objects to send over a socket.

Caching a const char * as a return type

Was reading up a bit on my C++, and found this article about RTTI (Runtime Type Identification):
http://msdn.microsoft.com/en-us/library/70ky2y6k(VS.80).aspx . Well, that's another subject :) - However, I stumbled upon a weird saying in the type_info-class, namely about the ::name-method. It says: "The type_info::name member function returns a const char* to a null-terminated string representing the human-readable name of the type. The memory pointed to is cached and should never be directly deallocated."
How can you implement something like this yourself!? I've been struggling quite a bit with this exact problem often before, as I don't want to make a new char-array for the caller to delete, so I've stuck to std::string thus far.
So, for the sake of simplicity, let's say I want to make a method that returns "Hello World!", let's call it
const char *getHelloString() const;
Personally, I would make it somehow like this (Pseudo):
const char *getHelloString() const
{
char *returnVal = new char[13];
strcpy("HelloWorld!", returnVal);
return returnVal
}
.. But this would mean that the caller should do a delete[] on my return pointer :(
Thx in advance

How about this:
const char *getHelloString() const
{
return "HelloWorld!";
}
Returning a literal directly means the space for the string is allocated in static storage by the compiler and will be available throughout the duration of the program.

I like all the answers about how the string could be statically allocated, but that's not necessarily true for all implementations, particularly the one whose documentation the original poster linked to. In this case, it appears that the decorated type name is stored statically in order to save space, and the undecorated type name is computed on demand and cached in a linked list.
If you're curious about how the Visual C++ type_info::name() implementation allocates and caches its memory, it's not hard to find out. First, create a tiny test program:
#include <cstdio>
#include <typeinfo>
#include <vector>
int main(int argc, char* argv[]) {
std::vector<int> v;
const type_info& ti = typeid(v);
const char* n = ti.name();
printf("%s\n", n);
return 0;
}
Build it and run it under a debugger (I used WinDbg) and look at the pointer returned by type_info::name(). Does it point to a global structure? If so, WinDbg's ln command will tell the name of the closest symbol:
0:000> ?? n
char * 0x00000000`00857290
"class std::vector<int,class std::allocator<int> >"
0:000> ln 0x00000000`00857290
0:000>
ln didn't print anything, which indicates that the string wasn't in the range of addresses owned by any specific module. It would be in that range if it was in the data or read-only data segment. Let's see if it was allocated on the heap, by searching all heaps for the address returned by type_info::name():
0:000> !heap -x 0x00000000`00857290
Entry User Heap Segment Size PrevSize Unused Flags
-------------------------------------------------------------------------------------------------------------
0000000000857280 0000000000857290 0000000000850000 0000000000850000 70 40 3e busy extra fill
Yes, it was allocated on the heap. Putting a breakpoint at the start of malloc() and restarting the program confirms it.
Looking at the declaration in <typeinfo> gives a clue about where the heap pointers are getting cached:
struct __type_info_node {
void *memPtr;
__type_info_node* next;
};
extern __type_info_node __type_info_root_node;
...
_CRTIMP_PURE const char* __CLR_OR_THIS_CALL name(__type_info_node* __ptype_info_node = &__type_info_root_node) const;
If you find the address of __type_info_root_node and walk down the list in the debugger, you quickly find a node containing the same address that was returned by type_info::name(). The list seems to be related to the caching scheme.
The MSDN page linked in the original question seems to fill in the blanks: the name is stored in its decorated form to save space, and this form is accessible via type_info::raw_name(). When you call type_info::name() for the first time on a given type, it undecorates the name, stores it in a heap-allocated buffer, caches the buffer pointer, and returns it.
The linked list may also be used to deallocate the cached strings during program exit (however, I didn't verify whether that is the case). This would ensure that they don't show up as memory leaks when you run a memory debugging tool.

Well gee, if we are talking about just a function, that you always want to return the same value. it's quite simple.
const char * foo()
{
static char[] return_val= "HelloWorld!";
return return_val;
}
The tricky bit is when you start doing things where you are caching the result, and then you have to consider Threading,or when your cache gets invalidated, and trying to store thing in thread local storage. But if it's just a one off output that is immediate copied, this should do the trick.
Alternately if you don't have a fixed size you have to do something where you have to either use a static buffer of arbitrary size.. in which you might eventually have something too large, or turn to a managed class say std::string.
const char * foo()
{
static std::string output;
DoCalculation(output);
return output.c_str();
}
also the function signature
const char *getHelloString() const;
is only applicable for member functions.
At which point you don't need to deal with static function local variables and could just use a member variable.

I think that since they know that there are a finite number of these, they just keep them around forever. It might be appropriate for you to do that in some instances, but as a general rule, std::string is going to be better.
They can also look up new calls to see if they made that string already and return the same pointer. Again, depending on what you are doing, this may be useful for you too.

Be careful when implementing a function that allocates a chunk of memory and then expects the caller to deallocate it, as you do in the OP:
const char *getHelloString() const
{
char *returnVal = new char[13];
strcpy("HelloWorld!", returnVal);
return returnVal
}
By doing this you are transferring ownership of the memory to the caller. If you call this code from some other function:
int main()
{
char * str = getHelloString();
delete str;
return 0;
}
...the semantics of transferring ownership of the memory is not clear, creating a situation where bugs and memory leaks are more likely.
Also, at least under Windows, if the two functions are in 2 different modules you could potentially corrupt the heap. In particular, if main() is in hello.exe, compiled in VC9, and getHelloString() is in utility.dll, compiled in VC6, you'll corrupt the heap when you delete the memory. This is because VC6 and VC9 both use their own heap, and they aren't the same heap, so you are allocating from one heap and deallocating from another.

Why does the return type need to be const? Don't think of the method as a get method, think of it as a create method. I've seen plenty of API that requires you to delete something a creation operator/method returns. Just make sure you note that in the documentation.
/* create a hello string
* must be deleted after use
*/
char *createHelloString() const
{
char *returnVal = new char[13];
strcpy("HelloWorld!", returnVal);
return returnVal
}

What I've often done when I need this sort of functionality is to have a char * pointer in the class - initialized to null - and allocate when required.
viz:
class CacheNameString
{
private:
char *name;
public:
CacheNameString():name(NULL) { }
const char *make_name(const char *v)
{
if (name != NULL)
free(name);
name = strdup(v);
return name;
}
};

Something like this would do:
const char *myfunction() {
static char *str = NULL; /* this only happens once */
delete [] str; /* delete previous cached version */
str = new char[strlen("whatever") + 1]; /* allocate space for the string and it's NUL terminator */
strcpy(str, "whatever");
return str;
}
EDIT: Something that occurred to me is that a good replacement for this could be returning a boost::shared_pointer instead. That way the caller can hold onto it as long as they want and they don't have to worry about explicitly deleting it. A fair compromise IMO.

The advice given that warns about the lifetime of the returned string is sound advise. You should always be careful about recognising your responsibilities when it comes to managing the lifetime of returned pointers. The practise is quite safe, however, provided the variable pointed to will outlast the call to the function that returned it. Consider, for instance, the pointer to const char returned by c_str() as a method of class std::string. This is returning a pointer to the memory managed by the string object which is guaranteed to be valid as long as the string object is not deleted or made to reallocate its internal memory.
In the case of the std::type_info class, it is a part of the C++ standard as its namespace implies. The memory returned from name() is actually pointed to static memory created by the compiler and linker when the class was compiled and is a part of the run time type identification (RTTI) system. Because it refers to a symbol in code space, you should not attempt to delete it.

I think something like this can only be implemented "cleanly" using objects and the RAII idiom.
When the objects destructor is called (obj goes out of scope), we can safely assume that the const char* pointers arent be used anymore.
example code:
class ICanReturnConstChars
{
std::stack<char*> cached_strings
public:
const char* yeahGiveItToMe(){
char* newmem = new char[something];
//write something to newmem
cached_strings.push_back(newmem);
return newmem;
}
~ICanReturnConstChars(){
while(!cached_strings.empty()){
delete [] cached_strings.back()
cached_strings.pop_back()
}
}
};
The only other possibility i know of is to pass a smart_ptr ..

It's probably done using a static buffer:
const char* GetHelloString()
{
static char buffer[256] = { 0 };
strcpy( buffer, "Hello World!" );
return buffer;
}
This buffer is like a global variable that is accessible only from this function.

You can't rely on GC; this is C++. That means you must keep the memory available until the program terminates. You simply don't know when it becomes safe to delete[] it. So, if you want to construct and return a const char*, simple new[] it and return it. Accept the unavoidable leak.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How is std::string::size() implemented? - c++

Related

Dynamic Memory for Char pointer Array

String encryption function works with char[], but not a plain string

C++ string length in bytes

Why do I get "double free or corruption"?

Caching a const char * as a return type

Categories

Resources