std::string as C++ byte array - c++

Google's Protocol buffer uses the C++ standard string class std::string as variable size byte array (see here) similar to Python where the string class is also used as byte array (at least until Python 3.0).
This approach seems to be good:
It allows fast assignment via assign and fast direct access via data that is not allowed with vector<byte>
It allows easier memory management and const references, unlike using byte*.
But I am curious: Is that the preferred way for a byte arrays in C++? What are the drawbacks of this approach (more than a few static_casts)

Personally, I prefer std::vector, since std::string is not guaranteed to be stored contiguously and std::string::data() need not be O(1). std::vector will have data member function in C++0x.

std::strings may have a reference counted implementation which may or may not be a advantage/disadvantage to what you're writing -- always be careful about that. std::string may not be thread safe. The potential advantage of std::string is easy concatenation, however, this can also be easily achieved using STL.
Also, all those problems in relation to protocols dissapear when using boost::asio and it's buffer objects.
As for drawbacks of std::vector:
fast assign can be done by a trick with std::swap
data can be accessed via &arr[0] -- vectors are guaranteed (?) to be continious (at least all implementations implement them so)
Personally I use std::vector for variable sized arrays, and boost::array for static sized ones.

I "think" that using std::vector is a better approach, because it was intended to be used as an array. It is true that all implementations(I know and heard of) store string "elements" in contiguous memory, but that doesn't make it standard. i.e. the code that uses std::string like a byte array, it assumes that the elements are contiguous where they don't have to be according to the standards.

Related

Best array type for a data member of an object for C++?

I've recently started learning C++ (having already a lot of experience with C).
I've briefly looked at vector<..> and array<..>.
I was wondering what is the best array type for a data member of an object for C++. Please keep in mind I want encapsulation, so this data member will be private - so I will need getter and setter functions for it.
I know the length of the array (the length will be kept constant, so no reallocation will be needed).
Would the traditional int array[100]; be the best?
Thanks in advance! :)
When you know at compile time the length of the array you should probably go with array. You could go for vector too, but that might make somebody think that the size could potentially change (or are at least not determined at compile time). If you're using large arrays and the variable lives in local scope you should consider using vector anyway.
Using int array[100]; could also be an alternative, it has some advantages and some disadvantage.
The advantage is that it might be slightly faster to set up (it would probably be faster than vector anyway) and you can initialize it in the classical way. Another is that some implementation will allow for classic array with variable length decided on instantiation (I don't think it has made it into the standard, but it's rather easy to support), if of course you accept to rely on the implementation supporting this extension.
The disadvantage is that you don't get easy full access to the STL methods (you still have the possibility via std::begin and std::end to get an iterator for the array), but also that if created as local variable you're bound to use stack space for storing the objects as opposed to vector which would need to dynamically allocate space for the storage (array can potentially use stack space).
Since you know C, I'll give you an analogy in terms of that language.
std::vector is used like int* array = malloc(sizeof int * size) is used in C. If the array is big and you don't want the owning object to be big, then use std::vector. This is important if you want your object to be efficiently movable or swappable. If you consider std::vector, don't forget to evaluate std::deque as well.
A manually allocated dynamic array has no advantages over std::vector.
std::array is used like int array[100] array is used in C. The lack of separate dynamic allocation makes creation of std::array fast. If you have many objects that contain small arrays, then std::array might be a good choice. If the size of the array is not constant or not known at compile time, then you cannot use std::array. In that case, use std::vector instead.
A regular C-style array does have one small advantage over std::array. Which is that when you initialize it with curly brackets, you may omit the size. With std::array, you must specify the size even if it seems redundant. This slightly nicer syntax does not outweigh the advantages of std::array, though. One significant advantage of std::array is that unlike a regular C-style array, it can be passed as a parameter and returned by value.
So, in conclusion, the bestness of an array depends on your needs. In some case, std::array is better and in others std::vector is. In some cases, std::array is not an option at all. There's no need for the C-style alternatives.

What is the correct way to deal with medium-sized byte arrays in modern C++?

Many languages and frameworks offer a "byte array" type, but the C++ standard library does not. What type is appropriate to use for medium-sized1, resizable byte arrays and how can I use that type efficiently? (particularly: allocating, passing as parameters and destroying)
1: By medium-sized I mean less than a 100 MB.
You can use std::vector<unsigned char>, or as #Oli suggested std::vector<uint8_t>. Yes, you can pass around it, without copying the whole contents.
void f(std::vector<unsigned char> & byteArray) //pass by reference : no copy!
{
//
}
std::vector<unsigned char> byteArray;
//...
f(byteArray); //no copying is being made!
Many languages and frameworks offer a "byte array" type, but the C++ standard library does not.
You're wrong here, C++ has a byte array type: std::vector<unsigned char>, whose storage is guaranteed to be continuous (there are other alternatives if you do not need this condition). You may want to read about references, move semantics, return value optimization and copy elision to know how to deal with those effectively.
Note: a byte, in C++ speak, is a char (either signed or unsigned). It may not be 8 bits long, you can get its size in bits via the CHAR_BITS macro.
I would recommend using std::deque<uint8_t> instead of std::vector<uint8_t> since the latter requires contiguous chunks of memory. I would steer clear of allocating large blocks of memory with new since it will initialize the block of memory using the default constructor which might be a little more expensive than you want.
In a pinch, I believe that you can customize boost::shared_ptr with a custom deallocator so that you can allocate with std::malloc avoiding the initialization overhead and deallocate with std::free while still maintaining the goodness that shared_ptr brings to the table.
vector<char> should be fine for your purposes. If you want a shared version to avoid copying you can use the following:
typedef shared_ptr<vector<uint8_t>> ByteArray;
if you know the size at compile-time you can use array which is slightly more space efficient.
also string can handle null characters which may or may not be more appropriate than vector.
Some extended implementations have a rope implementation http://en.wikipedia.org/wiki/Rope, http://www.aoc.nrao.edu/php/tjuerges/ALMA/STL/html-3.4.6/rope.html, that may be more appropriate.
There are performance reasons for using unique_ptr instead, at least for relatively large buffers. See https://stackoverflow.com/a/35798248/1992615 for details.

Is there a way to pass ownership of an existing char* in heap to a std::string? [duplicate]

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

C++ STL's String eqivalent for Binary Data

I am writing a C++ application and I was wondering what the C++ conventional way of storing a byte array in memory.
Is there something like a string, except specifically made for binary data.
Right now I am using a *unsigned char** array to store the data, but something more STL/C++ like would be better.
I'd use std::vector<unsigned char>. Most operations you need can be done using the STL with iterator ranges. Also, remember that if you really need the raw data &v[0] is guaranteed to give a pointer to the underlying array.
You can use std::string also for binary data. The length of the data in std::string is stored explicitly and not determined by null-termination, so null-bytes don't have special meaning in a std::string.
std::string is often more convenient than std::vector<char> because it provides many methods that are useful to work with binary data but not provided by vector. To parse/create binary data it is useful to have things like substr(), overloads for + and std::stringstream at your disposal. On vectors the algorithms from <algorithm> can be used to achieve the same effects, but it's more clumsy than the string methods. If you just act on "sequences of characters", std::string gives you the methods you usually want, even if these sequences happen to contain "binary" data.
You should use std::vector<unsigned char> or std::vector<uint8_t> (if you have a modern stdint.h header). There's nothing wrong with using unsigned char[] or uint8_t[] if you are working with fixed size buffers. Where std::vector really shines is when you need to grow or append to your buffers frequently. STL iterators have the same semantics as pointers, so STL algorithms will work equally well with std::vector and plain old arrays.
And as CAdaker pointed out, the expression &v[0] is guaranteed to give you the underlying pointer to the vector's buffer (and it's guaranteed to be one contiguous block of memory). This guarantee was added in an addendum to the C++ standard.
Personally, I'd avoid using std::string to manipulate arbitrary byte buffers, since I think it's potentially confusing, but it's not an unheard of practice.
There are multiple solutions but the closest one (I feel) is the std::vector<std::byte>> because it expresses the intent directly in code.
From : https://en.cppreference.com/w/cpp/types/byte
std::byte is a distinct type that implements the concept of byte as
specified in the C++ language definition.
Like char and unsigned char, it can be used to access raw memory
occupied by other objects (object representation), but unlike those
types, it is not a character type and is not an arithmetic type. A
byte is only a collection of bits, and the only operators defined for
it are the bitwise ones.
how about std::basic_string<uint8_t> ?

initializing std::string from char* without copy

I have a situation where I need to process large (many GB's) amounts of data as such:
build a large string by appending many smaller (C char*) strings
trim the string
convert the string into a C++ const std::string for processing (read only)
repeat
The data in each iteration are independent.
My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.
Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?
Alternatively, could I use stringstreams or something similar to re-use a large buffer?
Edit: Thanks for the answers, for clarity, I think a revised question would be:
How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.
You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.
A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to its logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.
Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.
UPDATE (since I still see occasional upvotes on this answer):
C++17 introduces another choice: std::string_view, which replaced std::string in many function signatures, is a non-owning reference to a character data. It is implicitly convertible from std::string, but can also be explicitly constructed from contiguous data owned somewhere else, avoiding the unnecessary copying std::string imposes.
Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.
See this link for more information on the reserve function.
To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.
http://www.sgi.com/tech/stl/Rope.html
Apparently rope is in the next version of the standard :-)
Note the developer joke. A rope is a big string. (Ha Ha) :-)
This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...
Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:
class lightweight_string { };
Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.
Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.
Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.
In this case, might it be better to process the char* directly, instead of assigning it to a std::string.