std::string or std::vector<char> to hold raw data - c++

I hope this question is appropriate for stackoverflow... What is the difference between storing raw data bytes (8 bits) in a std::string rather than storing them in std::vector<char>. I'm reading binary data from a file and storing those raw bytes in a std::string. This works well, there are no problems or issues with doing this. My program works as expected. However, other programmers prefer the std::vector<char> approach and suggest I stop using std::string as it's unsafe for raw bytes. So I'm wondering why might it be unsafe to use std::string to hold raw data bytes? I know std::string is most often used to store ASCII text, but a byte is a byte, so I don't understand the preference of the std::vector<char>.
Thanks for any advice!

The problem is not really whether it works or it doesn't. The problem is that it is utterly confusing for the next guy reading your code. std::string is meant for displaying text. Anybody reading your code will expect that. You'll declare your intent much better with a std::vector<char>.
It increases your WTF/min in code reviews.

In C++03, using std::string to store an array of byte data was not a good idea. By the standard, std::string did not have to store data contiguously. C++11 fixed that so that it's data does have to be contiguous.
So it would not be functional to do this in C++03. Not unless you have personally vetted your C++ standard library implementation of std::string to ensure that it is contiguous.
Either way, I would suggest vector<char>. Generally, when you see string, you expect it to be a... string. You know, a sequence of characters in some form of encoding. A vector<char> makes it obvious that it isn't a string, but an array of bytes.

Besides contiguous storage and code-clarity issues, I ran into some fairly insidious errors trying to use std::string to hold raw bytes.
Most of them centered around trying to convert a char array of bytes to std::string when interfacing with C libraries. For example:
std::string password = "pass\0word";
std::cout << password.length() << std::endl; // prints 4, not 9
Maybe you can fix that by specifying the length:
std::string password("pass\0word", 0, 9);
std::cout << password.length() << std::endl; // nope! still 4!
This is probably because the constructor expects to receive a C-string, not a byte array. There might be a better way, but I ended up with this:
std::string password("pass0word", 0, 9);
password[4] = '\0';
std::cout << password.length() << std::endl; // hurray! 9!
A little clunky. Thankfully I found this in unit testing, but I would have missed it if my test vectors didn't have null bytes. What makes this insidious is that the second approach above will work fine until the array contains a null byte.
So far std::vector<uint8_t> looks like a good option (thanks J.N. and Hurkyl):
char p[] = "pass\0word";
std::vector<uint8_t> password(p, p, p+9); // :)
Note: I haven't tried the iterator constructor with std::string, but this error is easy enough to make that it might be worth avoiding even the possibility.
Lessons learned:
Test byte-handling methods witih null byte-containing test vectors.
Be careful when (and I would say avoid) using std::string to hold raw bytes.

Related

What's the necessity of string in c++ while we already have char[]?

Many topics have discussed the difference between string and char[]. However, they are not clear to me to understand why we need to bring string in c++? Any insight is welcome, thanks!
char[] is C style. It is not object oriented, it forces you as the programmer to deal with implementation details (such as '\0' terminator) and rewrite standard code for handling strings every time over and over.
char[] is just an array of bytes, which can be used to store a string, but it is not a string in any meaningful way.
std::string is a class that properly represents a string and handles all string operations.
It lets you create objects and keep your code fully OOP (if that is what you want).
More importantly, it takes care of memory management for you.
Consider this simple piece of code:
// extract to string
#include <iostream>
#include <string>
main ()
{
std::string name;
std::cout << "Please, enter your name: ";
std::cin >> name;
std::cout << "Hello, " << name << "!\n";
return 0;
}
How would you write the same thing using char[]?
Assume you can not know in advance how long the name would be!
Same goes for string concatenation and other operations.
With real string represented as std::string you combine two strings with a simple += operator. One line.
If you are using char[] however, you need to do the following:
Calculate the size of the combined string + terminator character.
Allocate memory for the new combined string.
Use strncpy to copy first string to new array.
Use strncat to append second string to first string in new array.
Plus, you need to remember not to use the unsafe strcpy and strcat and to free the memory once you are done with the new string.
std::string saves you all that hassle and the many bugs you can introduce while writing it.
As noted by MSalters in a comment, strings can grow. This is, in my opinion, the strongest reason to have them in C++.
For example, the following code has a bug which may cause it to crash, or worse, to appear to work correctly:
char message[] = "Hello";
strcat(message, "World");
The same idea with std::string behaves correctly:
std::string message{"Hello"};
message += "World";
Additional benefits of std::string:
You can send it to functions by value, while char[] can only be sent by reference; this point looks rather insignificant, but it enables powerful code like std::vector<std::string> (a list of strings which you can add to)
std::string stores its length, so any operation which needs the length is more efficient
std::string works similarly to all other C++ containers (vector, etc) so if you are already familiar with containers, std::string is easy to use
std::string has overloaded comparison operators, so it's easy to use with std::map, std::sort, etc.
String class is no more than an amelioration of the char[] variable.
With strings you can achieve the same goals than the use of a char[] variable, but you won't have to matter about little tricks of char[] like pointers, segmentation faults...
This is a more convenient way to build strings, but you don't really see the "undergrounds" of the language, like how to implement concatenation or length functions...
Here is the documentation of the std::string class in C++ : C++ string documentation

Reading contents of file into dynamically allocated char* array- can I read into std::string instead?

I have found myself writing code which looks like this
// Treat the following as pseudocode - just an example
iofile.seekg(0, std::ios::end); // iofile is a file opened for read/write
uint64_t f_len = iofile.tellg();
if(f_len >= some_min_length)
{
// Focus on the following code here
char *buf = new char[7];
char buf2[]{"MYFILET"}; // just some random string
// if we see this it's a good indication
// the rest of the file will be in the
// expected format (unlikely to see this
// sequence in a "random file", but don't
// worry too much about this)
iofile.read(buf, 7);
if(memcmp(buf, buf2, 7) == 0) // I am confident this works
{
// carry on processing file ...
// ...
// ...
}
}
else
cout << "invalid file format" << endl;
This code is probably an okay sketch of what we might want to do when opening a file, which has some specified format (which I've dictated). We do some initial check to make sure the string "MYFILET" is at the start of the file - because I've decided all my files for the job I'm doing are going to start with this sequence of characters.
I think this code would be better if we didn't have to play around with "c-style" character arrays, but used strings everywhere instead. This would be advantageous because we could do things like if(buf == buf2) if buf and buf2 where std::strings.
A possible alternative could be,
// Focus on the following code here
std::string buf;
std::string buf2("MYFILET"); // very nice
buf.resize(7); // okay, but not great
iofile.read(buf.data(), 7); // pretty awful - error prone if wrong length argument given
// also we have to resize buf to 7 in the previous step
// lots of potential for mistakes here,
// and the length was used twice which is never good
if(buf == buf2) then do something
What are the problems with this?
We had to use the length variable 7 (or constant in this case) twice. Which is somewhere between "not ideal" and "potentially error prone".
We had to access the contents of buf using .data() which I shall assume here is implemented to return a raw pointer of some sort. I don't personally mind this too much, but others may prefer a more memory-safe solution, perhaps hinting we should use an iterator of some sort? I think in Visual Studio (for Windows users which I am not) then this may return an iterator anyway, which will give [?] warnings/errors [?] - not sure on this.
We had to have an additional resize statement for buf. It would be better if the size of buf could be automatically set somehow.
It is undefined behavior to write into the const char* returned by std::string::data(). However, you are free to use std::vector::data() in this way.
If you want to use std::string, and dislike setting the size yourself, you may consider whether you can use std::getline(). This is the free function, not std::istream::getline(). The std::string version will read up to a specified delimiter, so if you have a text format you can tell it to read until '\0' or some other character which will never occur, and it will automatically resize the given string to hold the contents.
If your file is binary in nature, rather than text, I think most people would find std::vector<char> to be a more natural fit than std::string anyway.
We had to use the length variable 7 (or constant in this case) twice.
Which is somewhere between "not ideal" and "potentially error prone".
The second time you can use buf.size()
iofile.read(buf.data(), buf.size());
We had to access the contents of buf using .data() which I shall
assume here is implemented to return a raw pointer of some sort.
And pointed by John Zwinck, .data() return a pointer to const.
I suppose you could define buf as std::vector<char>; for vector (if I'm not wrong) .data() return a pointer to char (in this case), not to const char.
size() and resize() are working in the same way.
We had to have an additional resize statement for buf. It would be
better if the size of buf could be automatically set somehow.
I don't think read() permit this.
p.s.: sorry for my bad English.
We can validate a signature without double buffering (rdbuf and a string) and allocating from the heap...
// terminating null not included
constexpr char sig[] = { 'M', 'Y', 'F', 'I', 'L', 'E', 'T' };
auto ok = all_of(begin(sig), end(sig), [&fs](char c) { return fs.get() == (int)c; });
if (ok) {}
template<class Src>
std::string read_string( Src& src, std::size_t count){
std::string buf;
buf.resize(count);
src.read(&buf.front(), 7); // in C++17 make it buf.data()
return buf;
}
Now auto read = read_string( iofile, 7 ); is clean at point of use.
buf2 is a bad plan. I'd do:
if(read=="MYFILET")
directly, or use a const char myfile_magic[] = "MYFILET";.
I liked many of the ideas from the examples above, however I wasn't completely satisfied that there was an answer which would produce undefined-behaviour-free code for C++11 and C++17. I currently write most of my code in C++11 - because I don't anticipate using it on a machine in the future which doesn't have a C++11 compiler.
If one doesn't, then I add a new compiler or change machines.
However it does seem to me to be a bad idea to write code which I know may not work under C++17... That's just my personal opinion. I don't anticipate using this code again, but I don't want to create a potential problem for myself in the future.
Therefore I have come up with the following code. I hope other users will give feedback to help improve this. (For example there is no error checking yet.)
std::string
fstream_read_string(std::fstream& src, std::size_t n)
{
char *const buffer = new char[n + 1];
src.read(buffer, n);
buffer[n] = '\0';
std::string ret(buffer);
delete [] buffer;
return ret;
}
This seems like a basic, probably fool-proof method... It's a shame there seems to be no way to get std::string to use the same memory as allocated by the call to new.
Note we had to add an extra trailing null character in the C-style string, which is sliced off in the C++-style std::string.

Is it possible to use an std::string for read()?

Is it possible to use an std::string for read() ?
Example :
std::string data;
read(fd, data, 42);
Normaly, we have to use char* but is it possible to directly use a std::string ? (I prefer don't create a char* for store the result)
Thank's
Well, you'll need to create a char* somehow, since that's what the
function requires. (BTW: you are talking about the Posix function
read, aren't you, and not std::istream::read?) The problem isn't
the char*, it's what the char* points to (which I suspect is what
you actually meant).
The simplest and usual solution here would be to use a local array:
char buffer[43];
int len = read(fd, buffer, 42);
if ( len < 0 ) {
// read error...
} else if ( len == 0 ) {
// eof...
} else {
std::string data(buffer, len);
}
If you want to capture directly into an std::string, however, this is
possible (although not necessarily a good idea):
std::string data;
data.resize( 42 );
int len = read( fd, &data[0], data.size() );
// error handling as above...
data.resize( len ); // If no error...
This avoids the copy, but quite frankly... The copy is insignificant
compared to the time necessary for the actual read and for the
allocation of the memory in the string. This also has the (probably
negligible) disadvantage of the resulting string having an actual buffer
of 42 bytes (rounded up to whatever), rather than just the minimum
necessary for the characters actually read.
(And since people sometimes raise the issue, with regards to the
contiguity of the memory in std:;string: this was an issue ten or more
years ago. The original specifications for std::string were designed
expressedly to allow non-contiguous implementations, along the lines of
the then popular rope class. In practice, no implementor found this
to be useful, and people did start assuming contiguity. At which point,
the standards committee decided to align the standard with existing
practice, and require contiguity. So... no implementation has ever not
been contiguous, and no future implementation will forego contiguity,
given the requirements in C++11.)
No, you cannot and you should not. Usually, std::string implementations internally store other information such as the size of the allocated memory and the length of the actual string. C++ documentation explicitly states that modifying values returned by c_str() or data() results in undefined behaviour.
If the read function requires a char *, then no. You could use the address of the first element of a std::vector of char as long as it's been resized first. I don't think old (pre C++11) strings are guarenteed to have contiguous memory otherwise you could do something similar with the string.
No, but
std::string data;
cin >> data;
works just fine. If you really want the behaviour of read(2), then you need to allocate and manage your own buffer of chars.
Because read() is intended for raw data input, std::string is actually a bad choice, because std::string handles text. std::vector seems like the right choice to handle raw data.
Using std::getline from the strings library - see cplusplus.com - can read from an stream and write directly into a string object. Example (again ripped from cplusplus.com - 1st hit on google for getline):
int main () {
string str;
cout << "Please enter full name: ";
getline (cin,str);
cout << "Thank you, " << str << ".\n";
}
So will work when reading from stdin (cin) and from a file (ifstream).

(How) can I use the Boost String Algorithms Library with c strings (char pointers)?

Is it possible to somehow adapt a c-style string/buffer (char* or wchar_t*) to work with the Boost String Algorithms Library?
That is, for example, it's trimalgorithm has the following declaration:
template<typename SequenceT>
void trim(SequenceT &, const std::locale & = std::locale());
and the implementation (look for trim_left_if) requires that the sequence type has a member function erase.
How could I use that with a raw character pointer / c string buffer?
char* pStr = getSomeCString(); // example, could also be something like wchar_t buf[256];
...
boost::trim(pStr); // HOW?
Ideally, the algorithms would work directly on the supplied buffer. (As far as possible. it obviously can't work if an algorithm needs to allocate additional space in the "string".)
#Vitaly asks: why can't you create a std::string from char buffer and then use it in algorithms?
The reason I have char* at all is that I'd like to use a few algorthims on our existing codebase. Refactoring all the char buffers to string would be more work than it's worth, and when changing or adapting something it would be nice to just be able to apply a given algorithm to any c-style string that happens to live in the current code.
Using a string would mean to (a) copy char* to string, (b) apply algorithm to string and (c) copy string back into char buffer.
For the SequenceT-type operations, you probably have to use std::string. If you wanted to implement that by yourself, you'd have to fulfill many more requirements for creation, destruction, value semantics etc. You'd basically end up with your implementation of std::string.
The RangeT-type operations might be, however, usable on char*s using the iterator_range from Boost.Range library. I didn't try it, though.
There exist some code which implements a std::string like string with a fixed buffer. With some tinkering you can modify this code to create a string type which uses an external buffer:
char buffer[100];
strcpy(buffer, " HELLO ");
xstr::xstring<xstr::fixed_char_buf<char> >
str(buffer, strlen(buffer), sizeof(buffer));
boost::algorithm::trim(str);
buffer[str.size()] = 0;
std::cout << buffer << std::endl; // prints "HELLO"
For this I added an constructor to xstr::xstring and xstr::fixed_char_buf to take the buffer, the size of the buffer which is in use and the maximum size of the buffer. Further I replaced the SIZE template argument with a member variable and changed the internal char array into a char pointer.
The xstr code is a bit old and will not compile without trouble on newer compilers but it needs some minor changes. Further I only added the things needed in this case. If you want to use this for real, you need to make some more changes to make sure it can not use uninitialized memory.
Anyway, it might be a good start for writing you own string adapter.
I don't know what platform you're targeting, but on most modern computers (including mobile ones like ARM) memory copy is so fast you shouldn't even waste your time optimizing memory copies. I say - wrap char* in std::string and check whether the performance suits your needs. Don't waste time on premature optimization.

How to mix std::string with Win32 functions that take char[] buffers?

There are a number of Win32 functions that take the address of a buffer, such as TCHAR[256], and write some data to that buffer. It may be less than the size of the buffer or it may be the entire buffer.
Often you'll call this in a loop, for example to read data off a stream or pipe. In the end I would like to efficiently return a string that has the complete data from all the iterated calls to retrieve this data. I had been thinking to use std::string since it's += is optimized in a similar way to Java or C#'s StringBuffer.append()/StringBuilder.Append() methods, favoring speed instead of memory.
But I'm not sure how best to co-mingle the std::string with Win32 functions, since these functions take the char[] to begin with. Any suggestions?
If the argument is input-only use std::string like this
std::string text("Hello");
w32function(text.c_str());
If the argument is input/output use std::vector<char> instead like this:
std::string input("input");
std::vector<char> input_vec(input.begin(), input.end());
input_vec.push_back('\0');
w32function(&input_vec[0], input_vec.size());
// Now, if you want std::string again, just make one from that vector:
std::string output(&input_vec[0]);
If the argument is output-only also use std::vector<Type> like this:
// allocates _at least_ 1k and sets those to 0
std::vector<unsigned char> buffer(1024, 0);
w32function(&buffer[0], buffer.size());
// use 'buffer' vector now as you see fit
You can also use std::basic_string<TCHAR> and std::vector<TCHAR> if needed.
You can read more on the subject in the book Effective STL by Scott Meyers.
std::string has a function c_str() that returns its equivalent C-style string. (const char *)
Further, std::string has overloaded assignment operator that takes a C-style string as input.
e.g. Let ss be std::string instance and sc be a C-style string then the interconversion can be performed as :
ss = sc; // from C-style string to std::string
sc = ss.c_str(); // from std::string to C-style string
UPDATE :
As Mike Weller pointed out, If UNICODE macro is defined, then the strings will be wchar_t* and hence you would have to use std::wstring instead.
Rather than std::string, I would suggest to use std::vector, and use &v.front() while using v.size(). Make sure to have space already allocated!
You have to be careful with std::string and binary data.
s += buf;//will treat buf as a null terminated string
s += std::string(buf, size);//would work
You need a compatible string type: typedef std::basic_string<TCHAR> tstring; is a good choice.
For input only arguments, you can use the .c_str() method.
For buffers, the choice is slightly less clear:
std::basic_string is not guaranteed to use contiguous storage like std::vector is. However, all std::basic_string implementations I've seen do use contiguous storage, and the C++ standards committee consider the missing guarantee to be a defect in the standard. The defect has been corrected in the C++0x draft.
If you're willing to bend the rules ever so slightly - with no negative consequences - you can use &(*aString.begin()) as a pointer to a TCHAR buffer of length aString.size(). Otherwise, you're stuck with std::vector for now.
Here's what the C++ standard committee have to say about contiguous string storage:
Not standardizing this existing
practice does not give implementors
more freedom. We thought it might a
decade ago. But the vendors have
spoken both with their
implementations, and with their voice
at the LWG meetings. The
implementations are going to be
contiguous no matter what the standard
says. So the standard might as well
give string clients more design
choices.