std::string vs. byte buffer (difference in c++) - c++

I have a project where I transfer data between client and server using boost.asio sockets. Once one side of the connection receives data, it converts it into a std::vector of std::strings which gets then passed on to the actualy recipient object of the data via previously defined "callback" functions. That way works fine so far, only, I am at this point using methods like atoi() and to_string to convert other data types than strings into a sendable format and back. This method is of course a bit wasteful in terms of network usage (especially when transferring bigger amounts of data than just single ints and floats). Therefore I'd like to serialize and deserialize the data. Since, effectively, any serialisation method will produce a byte array or buffer, it would be convenient for me to just use std::string instead. Is there any disadvantage to doing that? I would not understand why there should be once, since strings should be nothing more than byte arrays.

In terms of functionality, there's no real difference.
Both for performance reasons and for code clarity reasons, however, I would recommend using std::vector<uint8_t> instead, as it makes it far more clear to anyone maintaining the code that it's a sequence of bytes, not a String.

You should use std::string when you work with strings, when you work with binary blob you better work with std::vector<uint8_t>. There many benefits:
your intention is clear so code is less error prone
you would not pass your binary buffer as a string to function that expects std::string by mistake
you can override std::ostream<<() for this type to print blob in proper format (usually hex dump). Very unlikely that you would want to print binary blob as a string.
there could be more. Only benefit of std::string that I can see that you do not need to do typedef.

You're right. Strings are nothing more than byte arrays. std::string is just a convenient way to manage the buffer array that represents the string. That's it!
There's no disadvantage of using std::string unless you are working on something REALLY REALLY performance critical, like a kernel, for example... then working with std::string would have a considerable overhead. Besides that, feel free to use it.
--
An std::string behind the scenes needs to do a bunch of checks about the state of the string in order to decide if it will use the small-string optimization or not. Today pretty much all compilers implement small-string optimizations. They all use different techniques, but basically it needs to test bitflags that will tell if the string will be constructed in the stack or the heap. This overhead doesn't exist if you straight use char[]. But again, unless you are working on something REALLY critical, like a kernel, you won't notice anything and std::string is much more convenient.
Again, this is just ONE of the things that happens under the hood, just as an example to show the difference of them.

Depending on how often you're firing network messages, std::string should be fine. It's a convenience class that handles a lot of char work for you. If you have a lot of data to push though, it might be worth using a char array straight and converting it to bytes, just to minimise the extra overhead std::string has.
Edit: if someone could comment and point out why you think my answer is bad, that'd be great and help me learn too.

Related

Google protocol buffers and use of std::string for arbitrary binary data

Related Question: vector <unsigned char> vs string for binary data.
My code uses vector<unsigned char> for arbitrary binary data. However, a lot of my code has to interface to Google's protocol buffers code. Protocol buffers uses std::string for arbitrary binary data. This makes for a lot of ugly allocate/copy/free cycles just to move data between Google protocol buffers and my code. It also makes for a lot of cases where I need two constructors (one which takes a vector and one a string) or two functions to convert a function to binary wire format.
The code deals with raw structures a lot internally because structures are content-addressable (stored and retrieved by hash), signed, and so on. So it's not just a matter of the interface to Google's protocol buffers. Objects are handled in raw forms in other parts of the code as well.
One thing I could do is just cut all my code over to use std::string for arbitrary binary data. Another thing I could do is try to work out more efficient ways to store and retrieve my vectors into Google protocol buffer objects. I guess my other choice would be to create standard, simple, but slow conversion functions to strings and always use them. This would avoid the rampant code duplication, but would be worst from a performance standpoint.
Any suggestions? Any better choices I'm missing?
This is what I'm trying to avoid:
if(SomeCase)
{
std::vector<unsigned char> rawObject(objectdata().size());
memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
DoSometingWith(rawObject);
}
The allocate, copy, process, free is completely senseless when the raw data is already sitting there.
There are two ways to avoid copying that I know of and have seen in use.
The traditional way is indeed to pass a pointer/reference to a known entity. While this works fine and with a minimum of fuss, the issue is that it ties you up to a given representation, which entails conversions (as you experienced) when necessary.
The other way I discovered with LLVM:
ArrayRef
StringRef
The idea is amazingly simple: both hold a T* pointing to the start of an array of T and a size_t indicating the number of elements.
What is magical is that they completely hide the actual storage, be it a string, a vector, a dynamically or statically allocated C-array... it does not matter. The interface presented is completely uniform and no copy is involved.
The only caveat is that they do not take ownership of the memory (Ref!) so subtle bugs might creep in if you do not take care. Still, it is usually fine if you only use them in transient operations (within a function, for example) and do not store them for later use.
I have found them incredibly handy in buffer manipulations, especially thanks to the free slicing operations. Ranges are just so much easier to manipulate than pairs of iterators.
There is also a third way I have experienced, but never used in serious code up until now. The idea is that a vector<unsigned char> is a very low-level representation. By raising the abstraction layer and use, say, a Buffer class, you can completely encapsulate the exact way the memory is stored so that it becomes a non-issue, as far as your code is concerned.
And then, feel free to choose the one memory representation that requires the less conversion.
To avoid this code (which you present),
if(SomeCase)
{
std::vector<unsigned char> rawObject(objectdata().size());
memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
DoSometingWith(rawObject);
}
where presumably objectData is a std::string, consider
typedef unsigned char Byte;
typedef std::vector<Byte> ByteVector;
and then e.g.
if( someCase )
{
auto const& s = objectData;
doSomethingWith( ByteVector( s.begin(), s.end() ) );
}

NULL terminated string and its length

I have a legacy code that receives some proprietary, parses it and creates a bunch of static char arrays (embedded in class representing the message), to represent NULL strings. Afterwards pointers to the string are passed all around and finally serialized to some buffer.
Profiling shows that str*() methods take a lot of time.
Therefore I would like to use memcpy() whether it's possible. To achive it I need a way to associate length with pointer to NULL terminating string. I though about:
Using std::string looks less efficient, since it requires memory allocation and thread synchronization.
I can use std::pair<pointer to string, length>. But in this case I need to maintain length "manually".
What do you think?
use std::string
Profiling shows that str*() methods
take a lot of time
Sure they do ... operating on any array takes a lot of time.
Therefore I would like to use memcpy()
whether it's possible. To achive it I
need a way to associate length with
pointer to NULL terminating string. I
though about:
memcpy is not really any slower than strcpy. In fact if you perform a strlen to identify how much you are going to memcpy then strcpy is almost certainly faster.
Using std::string looks less
efficient, since it requires memory
allocation and thread synchronization
It may look less efficient but there are a lot of better minds than yours or mine that have worked on it
I can use std::pair. But in this case I need to
maintain length "manually".
thats one way to save yourself time on the length calculation. Obviously you need to maintain the length manually. This is how windows BSTRs work, effectively (though the length is stored immediately prior, in memory, to the actual string data). std::string. for example, already does this ...
What do you think?
I think your question is asked terribly. There is no real question asked which makes answering next to impossible. I advise you actually ask specific questions in the future.
Use std::string. It's an advice already given, but let me explain why:
One, it uses a custom memory allocation scheme. Your char* strings are probably malloc'ed. That means they are worst-case aligned, which really isn't needed for a char[]. std::string doesn't suffer from needless alignment. Furthermore, common implementatios of std::string use the "Small String Optimization" which eliminates a heap allocation altogether, and improves locality of reference. The string size will be on the same cache line as the char[] itself.
Two, it keeps the string length, which is indeed a speed optimization. Most str* functions are slower because they don't have this information up front.
A second option would be a rope class, e.g. from SGI. This be more efficient by eliminating some string copies.
Your post doesn't explain where the str*() function calls are coming from; passing around char * certainly doesn't invoke them. Identify the sites that actually do the string manipulation and then try to find out if they're doing so inefficiently. One common pitfall is that strcat first needs to scan the destination string for the terminating 0 character. If you call strcat several times in a row, you can end up with a O(N^2) algorithm, so be careful about this.
Replacing strcpy by memcpy doesn't make any significant difference; strcpy doesn't do an extra pass to find the length of the string, it's simply (conceptually!) a character-by-character copy that stops when it encounters the terminating 0. This is not much more expensive than memcpy, and always cheaper than strlen followed by memcpy.
The way to gain performance on string operations is to avoid copies where possible; don't worry about making the copying faster, instead try to copy less! And this holds for all string (and array) implementations, whether it be char *, std::string, std::vector<char>, or some custom string / array class.
What do I think? I think that you should do what everyone else obsessed with pre-optimization does. You should find the most obscure, unmaintainable, yet intuitively (to you anyway) high-performance way you can and do it that way. Sounds like you're onto something with your pair<char*,len> with malloc/memcpy idea there.
Whatever you do, do NOT use pre-existing, optimized wheels that make maintenence easier. Being maintainable is simply the least important thing imaginable when you're obsessed with intuitively measured performance gains. Further, as you well know, you're quite a bit smarter than those who wrote your compiler and its standard library implementation. So much so that you'd be seriously silly to trust their judgment on anything; you should really consider rewriting the entire thing yourself because it would perform better.
And ... the very LAST thing you'll want to do is use a profiler to test your intuition. That would be too scientific and methodical, and we all know that science is a bunch of bunk that's never gotten us anything; we also know that personal intuition and revelation is never, ever wrong. Why waste the time measuring with an objective tool when you've already intuitively grasped the situation's seemingliness?
Keep in mind that I'm being 100% honest in my opinion here. I don't have a sarcastic bone in my body.

Is there a way to reduce ostringstream malloc/free's?

I am writing an embedded app. In some places, I use std::ostringstream a lot, since it is very convenient for my purposes. However, I just discovered that the performance hit is extreme since adding data to the stream results in a lot of calls to malloc and free. Is there any way to avoid it?
My first thought was making the ostringstream static and resetting it using ostringstream::set(""). However, this can't be done as I need the functions to be reentrant.
Well, Booger's solution would be to switch to sprintf(). It's unsafe, and error-prone, but it is often faster.
Not always though. We can't use it (or ostringstream) on my real-time job after initialization because both perform memory allocations and deallocations.
Our way around the problem is to jump through a lot of hoops to make sure that we perform all string conversions at startup (when we don't have to be real-time yet). I do think there was one situation where we wrote our own converter into a fixed-sized stack-allocated array. We have some constraints on size we can count on for the specific conversions in question.
For a more general solution, you may consider writing your own version of ostringstream that uses a fixed-sized buffer (with error-checking on the bounds being stayed within, of course). It would be a bit of work, but if you have a lot of those stream operations it might be worth it.
If you know how big the data is before creating the stream you could use ostrstream whose constructor can take a buffer as a parameter. Thus there will be no memory management of the data.
Probably the approved way of dealing with this would be to create your own basic_stringbuf object to use with your ostringstream. For that, you have a couple of choices. One would be to use a fixed-size buffer, and have overflow simply fail when/if you try to create output that's too long. Another possibility would be to use a vector as the buffer. Unlike std::string, vector guarantees that appending data will have amortized constant complexity. It also never releases data from the buffer unless you force it to, so it'll normally grow to the maximum size you're dealing with. From that point, it shouldn't allocate or free memory unless you create a string that's beyond the length it currently has available.
std::ostringsteam is a convenience interface. It links a std::string to a std::ostream by providing a custom std::streambuf. You can implement your own std::streambuf. That allows you to do the entire memory management. You still get the nice formatting of std::ostream, but you have full control over the memory management. Of course, the consequence is that you get your formatted output in a char[] - but that's probably no big problem if you're an embedded developer.

Why do you prefer char* instead of string, in C++?

I'm a C programmer trying to write c++ code. I heard string in C++ was better than char* in terms of security, performance, etc, however sometimes it seems that char* is a better choice. Someone suggested that programmers should not use char* in C++ because we could do all things that char* could do with string, and it's more secure and faster.
Did you ever used char* in C++? What are the specific conditions?
It's safer to use std::string because you don't need to worry about allocating / deallocating memory for the string. The C++ std::string class is likely to use a char* array internally. However, the class will manage the allocation, reallocation, and deallocation of the internal array for you. This removes all the usual risks that come with using raw pointers, such as memory leaks, buffer overflows, etc.
Additionally, it's also incredibly convenient. You can copy strings, append to a string, etc., without having to manually provide buffer space or use functions like strcpy/strcat. With std::string it's as simple as using the = or + operators.
Basically, it's:
std::string s1 = "Hello ";
std::string s2 = s1 + "World";
versus...
const char* s1 = "Hello";
char s2[1024]; // How much should I really even allocate here?
strcpy(s2, s1);
strcat(s2, " World ");
Edit:
In response to your edit regarding the use of char* in C++: Many C++ programmers will claim you should never use char* unless you're working with some API/legacy function that requires it, in which case you can use the std::string::c_str() function to convert an std::string to const char*.
However, I would say there are some legitimate uses of C-arrays in C++. For example, if performance is absolutely critical, a small C-array on the stack may be a better solution than std::string. You may also be writing a program where you need absolute control over memory allocation/deallocation, in which case you would use char*. Also, as was pointed out in the comments section, std::string isn't guaranteed to provide you with a contiguous, writable buffer *, so you can't directly write from a file into an std::string if you need your program to be completely portable. However, in the event you need to do this, std::vector would still probably be preferable to using a raw C-array.
* Although in C++11 this has changed so that std::string does provide you with a contiguous buffer
Ok, the question changed a lot since I first answered.
Native char arrays are a nightmare of memory management and buffer overruns compared to std::string. I always prefer to use std::string.
That said, char array may be a better choice in some circumstances due to performance constraints (although std::string may actually be faster in some cases -- measure first!) or prohibition of dynamic memory usage in an embedded environment, etc.
In general, std::string is a cleaner, safer way to go because it removes the burden of memory management from the programmer. The main reason it can be faster than char *'s, is that std::string stores the length of the string. So, you don't have to do the work of iterating through the entire character array looking for the terminating NULL character each time you want to do a copy, append, etc.
That being said, you will still find a lot of c++ programs that use a mix of std::string and char *, or have even written their own string classes from scratch. In older compilers, std::string was a memory hog and not necessarily as fast as it could be. This has gotten better over time, but some high-performance applications (e.g., games and servers) can still benefit from hand-tuned string manipulations and memory-management.
I would recommend starting out with std::string, or possibly creating a wrapper for it with more utility functions (e.g., starts_with(), split(), format(), etc.). If you find when benchmarking your code that string manipulation is a bottleneck, or uses too much memory, you can then decide if you want to accept the extra risks and testing that a custom string library demands.
TIP: One way of getting around the memory issues and still use std::string is to use an embedded database such as SQLite. This is particularly useful when generating and manipulating extremely large lists of strings, and performance is better than what you might expect.
C char * strings cannot contain '\0' characters. C++ string can handle null characters without a problem. If users enter strings containing \0 and you use C strings, your code may fail. There are also security issues associated with this.
Implementations of std::string hide the memory usage from you. If you're writing performance-critical code, or you actually have to worry about memory fragmentation, then using char* can save you a lot of headaches.
For anything else though, the fact that std::string hides all of this from you makes it so much more usable.
String may actually be better in terms of performance. And innumerable other reasons - security, memory management, convenient string functions, make std::string an infinitely better choice.
Edit: To see why string might be more efficient, read Herb Sutter's books - he discusses a way to internally implement string to use Lazy Initialization combined with Referencing.
Use std::string for its incredible convenience - automatic memory handling and methods / operators. With some string manipulations, most implementations will have optimizations in place (such as delayed evaluation of several subsequent manipulations - saves memory copying).
If you need to rely on the specific char layout in memory for other optimizations, try std::vector<char> instead. If you have a non-empty vector vec, you can get a char* pointer using &vec[0] (the vector has to be nonempty).
Short answer, I don't. The exception is when I'm using third party libraries that require them. In those cases I try to stick to std::string::c_str().
In all my professional career I've had an opportunity to use std::string at only two projects. All others had their own string classes :)
Having said that, for new code I generally use std::string when I can, except for module boundaries (functions exported by dlls/shared libraries) where I tend to expose C interface and stay away from C++ types and issues with binary incompatibilities between compilers and std library implementations.
Compare and contrast the following C and C++ examples:
strlen(infinitelengthstring)
versus
string.length()
std::string is almost always preferred. Even for speed, it uses small array on the stack before dynamically allocating more for larger strings.
However, char* pointers are still needed in many situations for writing strings/data into a raw buffer (e.g. network I/O), which can't be done with std::string.
The only time I've recently used a C-style char string in a C++ program was on a project that needed to make use of two C libraries that (of course) used C strings exclusively. Converting back and forth between the two string types made the code really convoluted.
I also had to do some manipulation on the strings that's actually kind of awkward to do with std::string, but I wouldn't have considered that a good reason to use C strings in the absence of the above constraint.

initializing std::string from char* without copy

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.