C++ STL's String eqivalent for Binary Data - c++

I am writing a C++ application and I was wondering what the C++ conventional way of storing a byte array in memory.
Is there something like a string, except specifically made for binary data.
Right now I am using a *unsigned char** array to store the data, but something more STL/C++ like would be better.

I'd use std::vector<unsigned char>. Most operations you need can be done using the STL with iterator ranges. Also, remember that if you really need the raw data &v[0] is guaranteed to give a pointer to the underlying array.

You can use std::string also for binary data. The length of the data in std::string is stored explicitly and not determined by null-termination, so null-bytes don't have special meaning in a std::string.
std::string is often more convenient than std::vector<char> because it provides many methods that are useful to work with binary data but not provided by vector. To parse/create binary data it is useful to have things like substr(), overloads for + and std::stringstream at your disposal. On vectors the algorithms from <algorithm> can be used to achieve the same effects, but it's more clumsy than the string methods. If you just act on "sequences of characters", std::string gives you the methods you usually want, even if these sequences happen to contain "binary" data.

You should use std::vector<unsigned char> or std::vector<uint8_t> (if you have a modern stdint.h header). There's nothing wrong with using unsigned char[] or uint8_t[] if you are working with fixed size buffers. Where std::vector really shines is when you need to grow or append to your buffers frequently. STL iterators have the same semantics as pointers, so STL algorithms will work equally well with std::vector and plain old arrays.
And as CAdaker pointed out, the expression &v[0] is guaranteed to give you the underlying pointer to the vector's buffer (and it's guaranteed to be one contiguous block of memory). This guarantee was added in an addendum to the C++ standard.
Personally, I'd avoid using std::string to manipulate arbitrary byte buffers, since I think it's potentially confusing, but it's not an unheard of practice.

There are multiple solutions but the closest one (I feel) is the std::vector<std::byte>> because it expresses the intent directly in code.
From : https://en.cppreference.com/w/cpp/types/byte
std::byte is a distinct type that implements the concept of byte as
specified in the C++ language definition.
Like char and unsigned char, it can be used to access raw memory
occupied by other objects (object representation), but unlike those
types, it is not a character type and is not an arithmetic type. A
byte is only a collection of bits, and the only operators defined for
it are the bitwise ones.

how about std::basic_string<uint8_t> ?

Related

Does there exist a type similar to std::string (i.e., holding blobs of data of arbitrary length) that does not use the null terminator?

I use std::string to hold arbitrary blobs of binary data. It works this way, but is slightly inefficient as it needs to add the null terminator to the end of the blob. (C++11 spec is that c_str() and data() are the same, and return a pointer to a blob with a null terminator at the end.) Is there a better type to use, that still supports common operations (copy constructor, assignment operator, etc)?
Use std::vector. Documentation: https://en.cppreference.com/w/cpp/container/vector
Use the right tool for the job. A std::string is designed to store and manipulate sequences of char-like objects. This incorporates a mechanism (std::char_traits) to define text processing, which can be used to introduce more overhead than the terminating null character. As one example, the character traits could be used to sort strings case-insensitively. Blobs of binary data are not "text", so a string is not a great semantic fit for the job at hand.
The best fit for binary data is std::byte, but that was not introduced until C++17, and this question is tagged C++11. The pre-17 options are char and unsigned char (and std::uint8_t, if that type is provided and if you don't need to do type aliasing); I would recommend the latter since signed types add semantics not applicable to raw binary data.
Once you've picked the basic unit for your data, you need to collect several of these units into a container. When in doubt - when you have no particular reason to choose one container over another - use a std::vector, as that tends to perform well in many circumstances. (Not to mention that the functionality of a std::string is that of a std::vector<char> plus additional string functionality. You want to eliminate the overhead of the additional string functionality, right?)
std::vector<std::byte> // C++17
std::vector<unsigned char> // Earlier standards

Vector of double to a vector bytes

I have a std::vector<double> and need to work with a library the takes a const vector<uint8_t>. I specify what type the data is to the library with an enum.
Is there anyway that I can avoid copying data completely and have the byte vector internally refer to the same data as the double vector? Since the byte vector is const and the double vector won't change during the lifetime of the byte vector, this appears like it would be pretty safe. There is a lot of data so copying it really isn't an option.
If your "byte vector" were actually a vector of bytes then you would have a chance, because you can legally examine pretty much anything as a char array. However, uint32_ts are not bytes and they are certainly not chars. So, no, you basically can't do this without horrible hacky magic whose safety will be entirely implementation dependent.
Either way, you can't do it with the vector types: you'd have to cast and pass the result of std::vector::data(), i.e. a pointer.
Sorry but I have to recommend revisiting your design. If the library you're using really takes a vector of integers that is actually supposed to be a vector of doubles, then its developers have put you in an awkward position.

Google protocol buffers and use of std::string for arbitrary binary data

Related Question: vector <unsigned char> vs string for binary data.
My code uses vector<unsigned char> for arbitrary binary data. However, a lot of my code has to interface to Google's protocol buffers code. Protocol buffers uses std::string for arbitrary binary data. This makes for a lot of ugly allocate/copy/free cycles just to move data between Google protocol buffers and my code. It also makes for a lot of cases where I need two constructors (one which takes a vector and one a string) or two functions to convert a function to binary wire format.
The code deals with raw structures a lot internally because structures are content-addressable (stored and retrieved by hash), signed, and so on. So it's not just a matter of the interface to Google's protocol buffers. Objects are handled in raw forms in other parts of the code as well.
One thing I could do is just cut all my code over to use std::string for arbitrary binary data. Another thing I could do is try to work out more efficient ways to store and retrieve my vectors into Google protocol buffer objects. I guess my other choice would be to create standard, simple, but slow conversion functions to strings and always use them. This would avoid the rampant code duplication, but would be worst from a performance standpoint.
Any suggestions? Any better choices I'm missing?
This is what I'm trying to avoid:
if(SomeCase)
{
std::vector<unsigned char> rawObject(objectdata().size());
memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
DoSometingWith(rawObject);
}
The allocate, copy, process, free is completely senseless when the raw data is already sitting there.
There are two ways to avoid copying that I know of and have seen in use.
The traditional way is indeed to pass a pointer/reference to a known entity. While this works fine and with a minimum of fuss, the issue is that it ties you up to a given representation, which entails conversions (as you experienced) when necessary.
The other way I discovered with LLVM:
ArrayRef
StringRef
The idea is amazingly simple: both hold a T* pointing to the start of an array of T and a size_t indicating the number of elements.
What is magical is that they completely hide the actual storage, be it a string, a vector, a dynamically or statically allocated C-array... it does not matter. The interface presented is completely uniform and no copy is involved.
The only caveat is that they do not take ownership of the memory (Ref!) so subtle bugs might creep in if you do not take care. Still, it is usually fine if you only use them in transient operations (within a function, for example) and do not store them for later use.
I have found them incredibly handy in buffer manipulations, especially thanks to the free slicing operations. Ranges are just so much easier to manipulate than pairs of iterators.
There is also a third way I have experienced, but never used in serious code up until now. The idea is that a vector<unsigned char> is a very low-level representation. By raising the abstraction layer and use, say, a Buffer class, you can completely encapsulate the exact way the memory is stored so that it becomes a non-issue, as far as your code is concerned.
And then, feel free to choose the one memory representation that requires the less conversion.
To avoid this code (which you present),
if(SomeCase)
{
std::vector<unsigned char> rawObject(objectdata().size());
memcpy(&rawObject.front(), objectdata().data(), objectdata().size());
DoSometingWith(rawObject);
}
where presumably objectData is a std::string, consider
typedef unsigned char Byte;
typedef std::vector<Byte> ByteVector;
and then e.g.
if( someCase )
{
auto const& s = objectData;
doSomethingWith( ByteVector( s.begin(), s.end() ) );
}

What is the correct way to deal with medium-sized byte arrays in modern C++?

Many languages and frameworks offer a "byte array" type, but the C++ standard library does not. What type is appropriate to use for medium-sized1, resizable byte arrays and how can I use that type efficiently? (particularly: allocating, passing as parameters and destroying)
1: By medium-sized I mean less than a 100 MB.
You can use std::vector<unsigned char>, or as #Oli suggested std::vector<uint8_t>. Yes, you can pass around it, without copying the whole contents.
void f(std::vector<unsigned char> & byteArray) //pass by reference : no copy!
{
//
}
std::vector<unsigned char> byteArray;
//...
f(byteArray); //no copying is being made!
Many languages and frameworks offer a "byte array" type, but the C++ standard library does not.
You're wrong here, C++ has a byte array type: std::vector<unsigned char>, whose storage is guaranteed to be continuous (there are other alternatives if you do not need this condition). You may want to read about references, move semantics, return value optimization and copy elision to know how to deal with those effectively.
Note: a byte, in C++ speak, is a char (either signed or unsigned). It may not be 8 bits long, you can get its size in bits via the CHAR_BITS macro.
I would recommend using std::deque<uint8_t> instead of std::vector<uint8_t> since the latter requires contiguous chunks of memory. I would steer clear of allocating large blocks of memory with new since it will initialize the block of memory using the default constructor which might be a little more expensive than you want.
In a pinch, I believe that you can customize boost::shared_ptr with a custom deallocator so that you can allocate with std::malloc avoiding the initialization overhead and deallocate with std::free while still maintaining the goodness that shared_ptr brings to the table.
vector<char> should be fine for your purposes. If you want a shared version to avoid copying you can use the following:
typedef shared_ptr<vector<uint8_t>> ByteArray;
if you know the size at compile-time you can use array which is slightly more space efficient.
also string can handle null characters which may or may not be more appropriate than vector.
Some extended implementations have a rope implementation http://en.wikipedia.org/wiki/Rope, http://www.aoc.nrao.edu/php/tjuerges/ALMA/STL/html-3.4.6/rope.html, that may be more appropriate.
There are performance reasons for using unique_ptr instead, at least for relatively large buffers. See https://stackoverflow.com/a/35798248/1992615 for details.

std::string as C++ byte array

Google's Protocol buffer uses the C++ standard string class std::string as variable size byte array (see here) similar to Python where the string class is also used as byte array (at least until Python 3.0).
This approach seems to be good:
It allows fast assignment via assign and fast direct access via data that is not allowed with vector<byte>
It allows easier memory management and const references, unlike using byte*.
But I am curious: Is that the preferred way for a byte arrays in C++? What are the drawbacks of this approach (more than a few static_casts)
Personally, I prefer std::vector, since std::string is not guaranteed to be stored contiguously and std::string::data() need not be O(1). std::vector will have data member function in C++0x.
std::strings may have a reference counted implementation which may or may not be a advantage/disadvantage to what you're writing -- always be careful about that. std::string may not be thread safe. The potential advantage of std::string is easy concatenation, however, this can also be easily achieved using STL.
Also, all those problems in relation to protocols dissapear when using boost::asio and it's buffer objects.
As for drawbacks of std::vector:
fast assign can be done by a trick with std::swap
data can be accessed via &arr[0] -- vectors are guaranteed (?) to be continious (at least all implementations implement them so)
Personally I use std::vector for variable sized arrays, and boost::array for static sized ones.
I "think" that using std::vector is a better approach, because it was intended to be used as an array. It is true that all implementations(I know and heard of) store string "elements" in contiguous memory, but that doesn't make it standard. i.e. the code that uses std::string like a byte array, it assumes that the elements are contiguous where they don't have to be according to the standards.