Translating std::string to vector<char> - c++

I'm trying to convert a std::string to a char* (copying rather than casting) due to having to pass some data to a rather dated API.
On the face of it, there are a number of ways to do this, but it was suggested that I do this as a vector which seemed sensible. However, when I tried this the result was garbled. The code is like:
const string rawStr("My dog has no nose.");
vector<char> str(rawStr.begin(), rawStr.end());
cout << "\"" << (char*)(&str) << "\"" << endl;
(Note the unpleasant C cast - using static_cast does not work which is probably telling me something)
When I run this I get:
"P/"
Clearly not right. I took a look at the vector in gdb
(gdb) print str
$1 = std::vector of length 19, capacity 19 = {77 'M', 121 'y', 32 ' ', 100 'd', 111 'o',
103 'g', 32 ' ', 104 'h', 97 'a', 115 's', 32 ' ', 110 'n', 111 'o', 32 ' ', 110 'n',
111 'o', 115 's', 101 'e', 46 '.'}
Which looks correct although there's no null terminator at the end, which is concerning. The size of the vector (sizeof(str)) is 24 which suggests the characters are being stored as 8-bits.
Where am I going wrong?

The instance of std::vector is not itself an array of characters - it points to an array. Rather than (char*)(&str) try &str[0].
Judging from your gdb output you'll also want to push a zero onto the end of the vector before passing it to your legacy API.

First, the std::string does not contain the null termination as an element within the range covered by [begin(), end()). Second, the address of the vector is not the address of the first element of the vector's data. For this you need &str[0] or str.data():
#include <vector>
#include <string>
#include <iostream>
int main()
{
const std::string rawStr("My dog has no nose.");
std::vector<char> str(rawStr.begin(), rawStr.end());
str.push_back('\0');
std::cout << "\"" << &str[0] << "\"" << std::endl;
std::cout << "\"" << str.data() << "\"" << std::endl; // C++11
}

Two things you need to do:
1) take the address of the first character in the vector using &str[0]; This is absolutely fine (if a little contrived) since the standard guarantees the vector memory is contiguous. You can't simply write &str as that is the address of the vector which is not necessarily the address of the first data element.
2) inject a null terminator at the end of your vector if you want to display the characters as a string using the standard c-like functions. I might be wrong on this second point; does rawStr.end() point at an implicit null terminator associated with "My dog has no nose."?

The &str gets you a pointer to the vector object, not to the contained string of characters.
If you wish to print it as a C string, you'll need to push a 0 onto the end, and then outputting &str[0] (which will grab you the address to the beginning of the contained array).
This is very ugly, though. You are much better off either creating your own string vector class which inherits std::vector or using a function crafted to iterate through a vector, printing each element literally.
Edit:
If you are privy to C++11, for_each with a lambda could be used here in a clean way:
std::for_each(str.begin(), str.end(), [](char i) -> void {std::cout << i;});

Related

How to point to N (runtime defined number) bytes in the middle of something bigger?

The case is that I have a big set of binary data loaded in memory and need to perform bitwise operations between a N bytes length block of data and a N bytes chunk from the middle of this big set.
My first thought is to somehow get a pointer to N chars and then change it to point to start of the chunk. Pseudo code for what I intended to do:
#include <iostream>
int main()
{
unsigned int n = getBlockLength(), x = getChunkStartPos(), s = getBinarySetLength();
char *binary_set;
char (*chunk_ptr)[n]; //Let's suppose n = 5
//char block[n] = {'s', 'm', 't', 'h', 'g'};
binary_set = malloc(s);
fillBinarySet(binary_set, s); //Let's suppose s = 15 and binary_set is now filled with {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o'}
chunk_ptr = &binary_set + x;
std::cout << *chunk_ptr << std::endl; //should print "efghi"
//actually would do something like this:
//result = block & *chunk_ptr;
return 0;
}
This of course won't compile and the array of characters is just an example, as I did already said it would actually be a big set of binary data. Even if it would compile, I'm not sure of how system would know chuck_ptr points to n bytes (it searches for null character - that I can't put inside binary_set just for this purpose - when dealing with char, but what in the case of an array of bools?)...
By "big" I mean something between 125MB and 1000MB. It most likely will be just memory mapped (mmap()) from a file instead of fully loaded into memory but RAM consumption is expected to be almost the same, as entire set will be subject to frequent read and write operations. Referred block (and chunk) is supposed to be commonly between 1B and 5MB length.
Any thoughts on the fastest and less CPU and RAM intensive way to achieve this goal, please? This is the core of an already heavy application and the only working solutions (e.g.: looping over the binary data set byte by byte from start to the end of the chunk) I could develop don't suit performance requirements. :(
I just want some idea or direction, not asking you to do my job for me. If that is not possible with C or C++, I would like to get pointed to some solution in other high performatic, compiled, language. Thanks so much.

C++ string and string literal comparison

So I am trying to simply do a std::string == "string-literal" which would work just fine, except that I am creating my string with
std::string str(strCreateFrom, 0, strCreateFrom.find(' '));
and find returns string::npos now both of these contain the string "submit" however == returns false, now I have narrowed this down to the fact that the sizes are "different" even though they really aren't. str.size() is 7 and strlen("submit") is 6. Is this why == is failing, I assume it is but I don't see why... shouldn't it check to see if the last char of dif is \0 as is the case in this situation?
And is there anyway that I can get around this without having to using compare and specify the length to compare or change my string?
Edit:
std::string instruction(unparsed, 0, unparsed.find(' '));
boost::algorithm::to_lower(instruction);
for(int i = 0; i < instruction.size(); i++){
std::cout << "create from " << (int) unparsed[i] << std::endl;
std::cout << "instruction " << (int) instruction[i] << std::endl;
std::cout << "literal " << (int) "submit"[i] << std::endl;
}
std::cout << (instruction == "submit") << std::endl;
prints
create from 83
instruction 115
literal 115
create from 117
instruction 117
literal 117
create from 98
instruction 98
literal 98
create from 77
instruction 109
literal 109
create from 105
instruction 105
literal 105
create from 116
instruction 116
literal 116
create from 0
instruction 0
literal 0
0
EDIT:
For more clarification as to why I'm confused I read the basic_string.h header and saw this:
/**
* #brief Compare to a C string.
* #param s C string to compare against.
* #return Integer < 0, 0, or > 0.
*
* Returns an integer < 0 if this string is ordered before #a s, 0 if
* their values are equivalent, or > 0 if this string is ordered after
* #a s. Determines the effective length rlen of the strings to
* compare as the smallest of size() and the length of a string
* constructed from #a s. The function then compares the two strings
* by calling traits::compare(data(),s,rlen). If the result of the
* comparison is nonzero returns it, otherwise the shorter one is
* ordered first.
*/
int
compare(const _CharT* __s) const;
Which is called from operator== so I am trying to find out why the size dif matters.
I didn't quite understand your question more details may be needed, but you can use the c compare which shouldn't have issues with null termination counting.
You could use:
bool same = (0 == strcmp(strLiteral, stdTypeString.c_str());
strncmp also can be used to compare only a given number of chars in a char array
Or try to fix the creation of the stdstring
Your unparsed std::string is already bad. It already contains the extra null in the string, so what you should look at is how it is being created.
Like I mentioned before mystring[mystring.size() -1] is the last character not the terminating null so if you see a '\0' there like you do in your output it means the null is treated like part of the string.
Try to trace back your parsed input and keep making sure that mystring[mystring.size() -1] is not '\0'.
To answer your size diff question:
The two strings are not the same the literal is shorter and doesn't have a null.
Memory of std::string->c_str() [S,u,b,m,i,t,\0,\0] length = 7, memory size = 8;
Memory of literal [S,u,b,m,i,t,\0] length = 6, memory size = 7;
Compare stops comparing when it reaches the the terminating null in the literal but it uses the stored size for the std::string which is 7 seeing that literal terminated at 6 but the std is size 7 it will say that std is larger.
I think if you do the following it will return that the strings are the same (because it will create an std string with an extra null on the right side as well):
std::cout << (instruction == str("submit", _countof("submit"))) << std::endl;
PS: This is a common error made when taking a char* and making an std::string out of it, frequently just the array size itself is used, but that includes the terminating zero which std::string will add anyway. I believe that something like this is happening to your input somewhere and if you get add a -1 wherever that is everything will work as expected.

Assign a fixed length character array to a string

I have a fixed length character array I want to assign to a string. The problem comes if the character array is full, the assign fails. I thought of using the assign where you can supply n however that ignores \0s. For example:
std::string str;
char test1[4] = {'T', 'e', 's', 't'};
str.assign(test1); // BAD "Test2" (or some random extra characters)
str.assign(test1, 4); // GOOD "Test"
size_t len = strlen(test1); // BAD 5
char test2[4] = {'T', 'e', '\0', 't'};
str.assign(test2); // GOOD "Te"
str.assign(test2, 4); // BAD "Tet"
size_t len = strlen(test2); // GOOD 2
How can I assign a fixed length character array to a string correctly for both cases?
Use the "pair of iterators" form of assign.
str.assign(test1, std::find(test1, test1 + 4, '\0'));
Character buffers in C++ are either-or: either they are null terminated or they are not (and fixed-length). Mixing them in the way you do is thus not recommended. If you absolutely need this, there seems to be no alternative to manual copying until either the maximum length or a null terminator is reached.
for (char const* i = test1; i != test1 + length and *i != '\0'; ++i)
str += *i;
You want both NULL termination and fixed length? This is highly unusual and not recommended. You'll have to write your own function and push_back each individual character.
For the first case, when you do str.assign(test1) and str.assign(test2), you have to have /0 in your array, otherwise this is not a "char*" string and you can't assign it to std::string like this.
saw your serialization comment -- use std::vector<char>, std::array<char,4>, or just a 4 char array or container.
Your second 'bad' example - the one which prints out "Tet" - actually does work, but you have to be careful about how you check it:
str.assign(test2, 4); // BAD "Tet"
cout << "\"" << str << "\"" << endl;
does copy exactly four characters. If you run it through octal dump(od) on Linux say, using my.exe | od -c you'd get:
0000000 " T e \0 t " \n
0000007

Why the difference in size when declaring a string in C++?

I should know this, but I don't and I think its probably a major gap in my foundation knowledge so I thought I should ask the experts.
Given:
char s1[] = { 'M', 'y', 'W', 'o', 'r', 'd' };
char s2[] = "MyWord";
cout << strlen(s1)<<endl;
cout << strlen(s2)<<endl;
cout << sizeof(s1)<<endl;
cout << sizeof(s2)<<endl;
Why when declared as s1 is the strlen 9 but when declared as s2 is is 6? Where does the extra 3 come from, it it the lack of a null terminating character?
And I understand that sizeof(s2) is 1 byte larger than sizeof(s2) because s2 will have the null character automatically added?
Please be gentle, TIA!
char s2[] = "MyWord"; Auto adds the null terminator because of the "" declaration.
s1 declaration does not. When you do a strlen on s1 and it comes out to 9 it is because it eventually ran into a \0 and stopped. s1 shouldn't be used with strlen since it is not null terminated. strlen on s1 could have been 1,000. If you turn on memory violation detection strlen of s1 would crash.
The first one lacks the implicit \0 terminator present in the second. Thus:
The first is 1 less than the second, memory-wise
Doing strlen on the first is undefined behavior (since it lacks the terminator)
Lack of terminating null character as you say.
When the strlen function is called on s1, it counts the characters until it finds a terminating '\0'. You may have different result depending on how your memory is initialized.
Your definition of s2 is actually equivalent to
char s2[] = { 'M', 'y', 'W', 'o', 'r', 'd', '\0' };
s1 is only 9 by happenstance; you can only use strlen on terminated strings. Declare it as char s1[] = { 'M', 'y', 'W', 'o', 'r', 'd', '\0' }; and see what happens.
strlen(s1) can return any value > 6. because he is searching for the first '\0' char and you didn't provide it.
In your code,
s1 and s2 are arrays of char.
s1 last element is d, whereas s2 last element is \0.
That is, there is one more character in s2. It is a null-terminated string but s1 is not a null-terminated string.
Since s1 is not a null-terminated string, the expression strlen(s1) would invoke undefined behavior, as strlen will continue reading s1 beyond the last element.

C++ char array null terminator location

I am a student learning C++, and I am trying to understand how null-terminated character arrays work. Suppose I define a char array like so:
char* str1 = "hello world";
As expected, strlen(str1) is equal to 11, and it is null-terminated.
Where does C++ put the null terminator, if all 11 elements of the above char array are filled with the characters "hello world"? Is it actually allocating an array of length 12 instead of 11, with the 12th character being '\0'? CPlusPlus.com seems to suggest that one of the 11 would need to be '\0', unless it is indeed allocating 12.
Suppose I do the following:
// Create a new char array
char* str2 = (char*) malloc( strlen(str1) );
// Copy the first one to the second one
strncpy( str2, str1, strlen(str1) );
// Output the second one
cout << "Str2: " << str2 << endl;
This outputs Str2: hello worldatcomY╗°g♠↕, which I assume is C++ reading the memory at the location pointed to by the pointer char* str2 until it encounters what it interprets to be a null character.
However, if I then do this:
// Null-terminate the second one
str2[strlen(str1)] = '\0';
// Output the second one again
cout << "Terminated Str2: " << str2 << endl;
It outputs Terminated Str2: hello world as expected.
But doesn't writing to str2[11] imply that we are writing outside of the allocated memory space of str2, since str2[11] is the 12th byte, but we only allocated 11 bytes?
Running this code does not seem to cause any compiler warnings or run-time errors. Is this safe to do in practice? Would it be better to use malloc( strlen(str1) + 1 ) instead of malloc( strlen(str1) )?
In the case of a string literal the compiler is actually reserving an extra char element for the \0 element.
// Create a new char array
char* str2 = (char*) malloc( strlen(str1) );
This is a common mistake new C programmers make. When allocating the storage for a char* you need to allocate the number of characters + 1 more to store the \0. Not allocating the extra storage here means this line is also illegal
// Null-terminate the second one
str2[strlen(str1)] = '\0';
Here you're actually writing past the end of the memory you allocated. When allocating X elements the last legal byte you can access is the memory address offset by X - 1. Writing to the X element causes undefined behavior. It will often work but is a ticking time bomb.
The proper way to write this is as follows
size_t size = strlen(str1) + sizeof(char);
char* str2 = (char*) malloc(size);
strncpy( str2, str1, size);
// Output the second one
cout << "Str2: " << str2 << endl;
In this example the str2[size - 1] = '\0' isn't actually needed. The strncpy function will fill all extra spaces with the null terminator. Here there are only size - 1 elements in str1 so the final element in the array is unneeded and will be filled with \0
Is it actually allocating an array of length 12 instead of 11, with the 12th character being '\0'?
Yes.
But doesn't writing to str2[11] imply that we are writing outside of the allocated memory space of str2, since str2[11] is the 12th byte, but we only allocated 11 bytes?
Yes.
Would it be better to use malloc( strlen(str1) + 1 ) instead of malloc( strlen(str1) )?
Yes, because the second form is not long enough to copy the string into.
Running this code does not seem to cause any compiler warnings or run-time errors.
Detecting this in all but the simplest cases is a very difficult problem. So the compiler authors simply don't bother.
This sort of complexity is exactly why you should be using std::string rather than raw C-style strings if you are writing C++. It's as simple as this:
std::string str1 = "hello world";
std::string str2 = str1;
The literal "hello world" is a char array that looks like:
{ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '\0' }
So, yes, the literal is 12 chars in size.
Also, malloc( strlen(str1) ) is allocating memory for 1 less byte than is needed, since strlen returns the length of the string, not including the NUL terminator. Writing to str[strlen(str1)] is writing 1 byte past the amount of memory that you've allocated.
Your compiler won't tell you that, but if you run your program through valgrind or a similar program available on your system it'll tell you if you're accessing memory you shouldn't be.
I think you are confused by the return value of strlen. It returns the length of the string, and it should not be confused with the size of the array that holds the string. Consider this example :
char* str = "Hello\0 world";
I added a null character in the middle of the string, which is perfectly valid. Here the array will have a length of 13 (12 characters + the final null character), but strlen(str) will return 5, because there are 5 characters before the first null character. strlen just counts the characters until a null character is found.
So if I use your code :
char* str1 = "Hello\0 world";
char* str2 = (char*) malloc(strlen(str1)); // strlen(str1) will return 5
strncpy(str2, str1, strlen(str1));
cout << "Str2: " << str2 << endl;
The str2 array will have a length of 5, and won't be terminated by a null character (because strlen doesn't count it). Is this what you expected?
For a standard C string the length of the array that is storing the string is always one character longer then the length of the string in characters. So your "hello world" string has a string length of 11 but requires a backing array with 12 entries.
The reason for this is simply the way those string are read. The functions handling those strings basically read the characters of the string one by one until they find the termination character '\0' and stop at this point. If this character is missing those functions just keep reading the memory until they either hit a protected memory area that causes the host operating system to kill your application or until they find the termination character.
Also if you initialize a character array with the length 11 and write the string "hello world" into it will yield massive problems. Because the array is expected to hold at least 12 characters. That means the byte that follows the array in the memory is overwritten. Resulting in unpredictable side effects.
Also while you are working with C++, you might want to look into std:string. This class is accessible if you are using C++ and provides better handling of strings. It might be worth looking into that.
I think what you need to know is that char arrays starts from 0 and goes until array length-1 and on position array length has the terminator('\0').
In your case:
str1[0] == 'h';
str1[10] == 'd';
str1[11] == '\0';
This is why is correct str2[strlen(str1)] = '\0';
The problem with the output after the strncpy is because it copys 11 elements(0..10) so you need to put manually the terminator(str2[11] = '\0').