char[] (c lang) to string (c++ lang) conversion - c++

I can see that almost all modern APIs are developed in the C language. There are reasons for that: processing speed, low level language, cross platform and so on.
Nowadays, I program in C++ because of its Object Orientation, the use of string, the STL but, mainly because it is a better C.
However when my C++ programs need to interact with C APIs I really get upset when I need to convert char[] types to C++ strings, then operate on these strings using its powerful methods, and finally convert from theses strings to char[] again (because the API needs to receive char[]).
If I repeat these operations for millions of records the processing times are higher because of the conversion task.
For that simple reason, I feel that char[] is an obstacle in the moment to assume the C++ as a better c.
I would like to know if you feel the same, if not (I hope so!) I really would like to know which is the best way for C++ to coexist with char[] types without doing those awful conversions.
Thanks for your attention.

The C++ string class has a lot of problems, and yes, what you're describing is one of them.
More specifically, there is no way to do string processing without creating a copy of the string, which may be expensive.
And because virtually all string processing algorithms are implemented as class members, they can only be used on the string class.
A solution you might want to experiment with is the combination of Boost.Range and Boost.StringAlgo.
Range allows you to create sequences out of a pair of iterators. They don't take ownership of the data, so they don't copy the string. they just point to the beginning and end of your char* string.
And Boost.StringAlgo implements all the common string operations as non-member functions, that can be applied to any sequence of characters. Such as, for example, a Boost range.
The combination of these two libraries pretty much solve the problem. They let you avoid having to copy your strings to process them.
Another solution might be to store your string data as std::string's all the time. When you need to pass a char* to some API functoin, simply pass it the address of the first character. (&str[0]).
The problem with this second approach is that std::string doesn't guarantee that its string buffer is null-terminated, so you either have to rely on implementation details, or manually add a null byte as part of the string.

If you use std::vector<char> instead of std::string, the underlying storage will be a C array that can be accessed with &someVec[0]. However, you do lose a lot of std::string conveniences such as operator+.
That said, I'd suggest just avoiding C APIs that mutate strings as much as possible. If you need to pass an immutable string to a C function, you can use c_str(), which is fast and non-copying on most std::string implementations.

I'm not sure what you mean by "conversion", but won't the following suffice for moving between char*, char[], and std::string?
char[] charString = {'a', 'b', 'c', '\0'};
std::string standardString(&charString[0]);
const char* stringPointer(standardString.c_str());

I don't think it's as bad as you make it out to be.
There is a cost converting a char[] to a std::string, but if you're going to be modifying the string, you have to pay that cost anyway whether converting to a std::string or copying to another char[] buffer.
The conversion going the other way (via string.c_str()) is usually trivial. It's usually returning a pointer to an internal buffer (just don't give that buffer to code that will modify it).

I'm not sure why you would be constrained to using C strings and still have an environment that runs C++ code but if you really don't want the overhead of conversion, than don't convert. Just write routines that operate on the C strings.
Another reason for converting to C++ style strings is for bound safety.

"... because it is a better C."
Baloney. C++ is a vastly inferior dialect of C. The problems it solves are trivial, the problems it brings, much worse than those it solves.

Related

std::string vs. byte buffer (difference in c++)

I have a project where I transfer data between client and server using boost.asio sockets. Once one side of the connection receives data, it converts it into a std::vector of std::strings which gets then passed on to the actualy recipient object of the data via previously defined "callback" functions. That way works fine so far, only, I am at this point using methods like atoi() and to_string to convert other data types than strings into a sendable format and back. This method is of course a bit wasteful in terms of network usage (especially when transferring bigger amounts of data than just single ints and floats). Therefore I'd like to serialize and deserialize the data. Since, effectively, any serialisation method will produce a byte array or buffer, it would be convenient for me to just use std::string instead. Is there any disadvantage to doing that? I would not understand why there should be once, since strings should be nothing more than byte arrays.
In terms of functionality, there's no real difference.
Both for performance reasons and for code clarity reasons, however, I would recommend using std::vector<uint8_t> instead, as it makes it far more clear to anyone maintaining the code that it's a sequence of bytes, not a String.
You should use std::string when you work with strings, when you work with binary blob you better work with std::vector<uint8_t>. There many benefits:
your intention is clear so code is less error prone
you would not pass your binary buffer as a string to function that expects std::string by mistake
you can override std::ostream<<() for this type to print blob in proper format (usually hex dump). Very unlikely that you would want to print binary blob as a string.
there could be more. Only benefit of std::string that I can see that you do not need to do typedef.
You're right. Strings are nothing more than byte arrays. std::string is just a convenient way to manage the buffer array that represents the string. That's it!
There's no disadvantage of using std::string unless you are working on something REALLY REALLY performance critical, like a kernel, for example... then working with std::string would have a considerable overhead. Besides that, feel free to use it.
--
An std::string behind the scenes needs to do a bunch of checks about the state of the string in order to decide if it will use the small-string optimization or not. Today pretty much all compilers implement small-string optimizations. They all use different techniques, but basically it needs to test bitflags that will tell if the string will be constructed in the stack or the heap. This overhead doesn't exist if you straight use char[]. But again, unless you are working on something REALLY critical, like a kernel, you won't notice anything and std::string is much more convenient.
Again, this is just ONE of the things that happens under the hood, just as an example to show the difference of them.
Depending on how often you're firing network messages, std::string should be fine. It's a convenience class that handles a lot of char work for you. If you have a lot of data to push though, it might be worth using a char array straight and converting it to bytes, just to minimise the extra overhead std::string has.
Edit: if someone could comment and point out why you think my answer is bad, that'd be great and help me learn too.

Does C++17 std::basic_string_view invalidates the use of C strings?

C++17 is introducing std::basic_string_view which is non-owning string version with its class storing only a pointer to the first element of a string and size of the string. Is there still a reason to keep using C strings?
Is there still a reason to keep using C strings?
I think it would be fair to say that other than speaking to a C API, there has never been a reason to use C strings.
When designing an interface of a function or method that simply needs a read-only representation of characters, you will want to prefer std::string_view. E.g. searching a string, producing an upper-case copy, printing it, and so on.
When designing an interface that takes a copy of a string of characters, you should probably prefer first and last iterators. However, std::string_view could be thought of as a proxy for these iterators, so string_view is appropriate.
If you want to take ownership of a long string, probably prefer to pass std::string, either by value or by r-value reference.
When designing an object that marshals calls to a c API that is expecting null-terminated strings, you should prefer std::string or std::string const& - because its c_str() method will correctly yield a null-terminated string.
When storing strings in objects (which are not temporary proxies), prefer std::string.
Of course the use of const char* as an owner of data in c++ is never appropriate. There is always a better way. This has been true since c++98.
"Invalidate" has a technical meaning here which I think is unintentional. It sounds like "obviate" is the intended word.
You are still going to have to produce and consume C strings in order to interact with common APIs. For example, POSIX has open and execve, Win32 has the rough equivalents CreateFile and CreateProcess, and all of these functions operate on C strings. But in the end, you are still calling str.data() or str.c_str() in order to interact with these APIs, so that use of C strings is not going away, no matter whether str is a std::basic_string_view or std::basic_string.
You will still have to understand what C strings are in order to correctly use these APIs. While std::string guarantees a NUL terminator, std::string_view does not, and neither structure guarantees that there is no NUL byte somewhere inside the string. You will have to sanitize NUL bytes in the middle of your string in either case.
This does not even touch on the wealth of 3rd party libraries which use C strings, or the cost of retrofitting your own code which uses C strings to one which uses std::string_view.

Is there a way to pass ownership of an existing char* in heap to a std::string? [duplicate]

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

Why do you prefer char* instead of string, in C++?

I'm a C programmer trying to write c++ code. I heard string in C++ was better than char* in terms of security, performance, etc, however sometimes it seems that char* is a better choice. Someone suggested that programmers should not use char* in C++ because we could do all things that char* could do with string, and it's more secure and faster.
Did you ever used char* in C++? What are the specific conditions?
It's safer to use std::string because you don't need to worry about allocating / deallocating memory for the string. The C++ std::string class is likely to use a char* array internally. However, the class will manage the allocation, reallocation, and deallocation of the internal array for you. This removes all the usual risks that come with using raw pointers, such as memory leaks, buffer overflows, etc.
Additionally, it's also incredibly convenient. You can copy strings, append to a string, etc., without having to manually provide buffer space or use functions like strcpy/strcat. With std::string it's as simple as using the = or + operators.
Basically, it's:
std::string s1 = "Hello ";
std::string s2 = s1 + "World";
versus...
const char* s1 = "Hello";
char s2[1024]; // How much should I really even allocate here?
strcpy(s2, s1);
strcat(s2, " World ");
Edit:
In response to your edit regarding the use of char* in C++: Many C++ programmers will claim you should never use char* unless you're working with some API/legacy function that requires it, in which case you can use the std::string::c_str() function to convert an std::string to const char*.
However, I would say there are some legitimate uses of C-arrays in C++. For example, if performance is absolutely critical, a small C-array on the stack may be a better solution than std::string. You may also be writing a program where you need absolute control over memory allocation/deallocation, in which case you would use char*. Also, as was pointed out in the comments section, std::string isn't guaranteed to provide you with a contiguous, writable buffer *, so you can't directly write from a file into an std::string if you need your program to be completely portable. However, in the event you need to do this, std::vector would still probably be preferable to using a raw C-array.
* Although in C++11 this has changed so that std::string does provide you with a contiguous buffer
Ok, the question changed a lot since I first answered.
Native char arrays are a nightmare of memory management and buffer overruns compared to std::string. I always prefer to use std::string.
That said, char array may be a better choice in some circumstances due to performance constraints (although std::string may actually be faster in some cases -- measure first!) or prohibition of dynamic memory usage in an embedded environment, etc.
In general, std::string is a cleaner, safer way to go because it removes the burden of memory management from the programmer. The main reason it can be faster than char *'s, is that std::string stores the length of the string. So, you don't have to do the work of iterating through the entire character array looking for the terminating NULL character each time you want to do a copy, append, etc.
That being said, you will still find a lot of c++ programs that use a mix of std::string and char *, or have even written their own string classes from scratch. In older compilers, std::string was a memory hog and not necessarily as fast as it could be. This has gotten better over time, but some high-performance applications (e.g., games and servers) can still benefit from hand-tuned string manipulations and memory-management.
I would recommend starting out with std::string, or possibly creating a wrapper for it with more utility functions (e.g., starts_with(), split(), format(), etc.). If you find when benchmarking your code that string manipulation is a bottleneck, or uses too much memory, you can then decide if you want to accept the extra risks and testing that a custom string library demands.
TIP: One way of getting around the memory issues and still use std::string is to use an embedded database such as SQLite. This is particularly useful when generating and manipulating extremely large lists of strings, and performance is better than what you might expect.
C char * strings cannot contain '\0' characters. C++ string can handle null characters without a problem. If users enter strings containing \0 and you use C strings, your code may fail. There are also security issues associated with this.
Implementations of std::string hide the memory usage from you. If you're writing performance-critical code, or you actually have to worry about memory fragmentation, then using char* can save you a lot of headaches.
For anything else though, the fact that std::string hides all of this from you makes it so much more usable.
String may actually be better in terms of performance. And innumerable other reasons - security, memory management, convenient string functions, make std::string an infinitely better choice.
Edit: To see why string might be more efficient, read Herb Sutter's books - he discusses a way to internally implement string to use Lazy Initialization combined with Referencing.
Use std::string for its incredible convenience - automatic memory handling and methods / operators. With some string manipulations, most implementations will have optimizations in place (such as delayed evaluation of several subsequent manipulations - saves memory copying).
If you need to rely on the specific char layout in memory for other optimizations, try std::vector<char> instead. If you have a non-empty vector vec, you can get a char* pointer using &vec[0] (the vector has to be nonempty).
Short answer, I don't. The exception is when I'm using third party libraries that require them. In those cases I try to stick to std::string::c_str().
In all my professional career I've had an opportunity to use std::string at only two projects. All others had their own string classes :)
Having said that, for new code I generally use std::string when I can, except for module boundaries (functions exported by dlls/shared libraries) where I tend to expose C interface and stay away from C++ types and issues with binary incompatibilities between compilers and std library implementations.
Compare and contrast the following C and C++ examples:
strlen(infinitelengthstring)
versus
string.length()
std::string is almost always preferred. Even for speed, it uses small array on the stack before dynamically allocating more for larger strings.
However, char* pointers are still needed in many situations for writing strings/data into a raw buffer (e.g. network I/O), which can't be done with std::string.
The only time I've recently used a C-style char string in a C++ program was on a project that needed to make use of two C libraries that (of course) used C strings exclusively. Converting back and forth between the two string types made the code really convoluted.
I also had to do some manipulation on the strings that's actually kind of awkward to do with std::string, but I wouldn't have considered that a good reason to use C strings in the absence of the above constraint.

initializing std::string from char* without copy

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.