Null bytes in char* in QByteArray with QDataStream - c++

I'm discovered that char* in QByteArray have null bytes. Code:
QByteArray arr;
QDataStream stream(&arr, QIODevice::WriteOnly);
stream << "hello";
Look at debugger variable view:
I don't understand why I have three empty bytes at the beginning. I know that [3] byte is string length. Can I remove last byte? I know it's null-terminated string, but for my application I must have raw bytes (with one byte at beggining for store length).
More weird for me is when I use QString:
QString str = "hello";
[rest of code same as above]
stream << str;
It's don't have null at end, so I think maybe null bytes before each char informs that next byte is char?
Just two questions:
Why so much null bytes?
How I can remove it, including last null byte?

I don't understand why I have three empty bytes at the beginning.
It's a fixed-size, uint32_t (4-byte) header. It's four bytes so that it can specify data lengths as long as (2^32-1) bytes. If it was only a single byte, then it would only be able to describe strings up to 255 bytes long, because that's the largest integer value that can fit into a single byte.
Can I remove last byte? I know it's null-terminated string, but for my
application I must have raw bytes (with one byte at beggining for
store length).
Sure, as long as the code that will later parse the data array is not depending on the presence of a trailing NUL byte to work correctly.
More weird for me is when I use QString [...] it's don't have null at end, so I think maybe null bytes before each char informs that next byte is char?
Per the Qt serialization documentation page, a QString is serialized as:
- If the string is null: 0xFFFFFFFF (quint32)
- Otherwise: The string length in bytes (quint32) followed by the data in UTF-16.
If you don't like that format, instead of serializing the QString directly, you could do something like
stream << str.toUtf8();
instead, and that way the data in your QByteArray would be in a simpler format (UTF-8).
Why so much null bytes?
They are used in fixed-size header fields when the length-values being encoded are small; or to indicate the end of NUL-terminated C strings.
How I can remove it, including last null byte?
You could add the string in your preferred format (no NUL terminator but with a single length header-byte) like this:
const char * hello = "hello";
char slen = strlen(hello);
stream.writeRawData(&slen, 1);
stream.writeRawData(hello, slen);
... but if you have the choice, I highly recommend just keeping the NUL-terminator bytes at the end of the strings, for these reasons:
A single preceding length-byte will limit your strings to 255 bytes long (or less), which is an unnecessary restriction that will likely haunt you in the future.
Avoiding the NUL-terminator byte doesn't actually save any space, because you've added a string-length byte to compensate.
If the NUL-terminator byte is there, you can simply pass a pointer to the first byte of the string directly to any code expects a C-style string, and it will be able to use the string immediately (without any data-conversion steps). If you rely on a different convention instead, you'll end up having to make a copy of the entire string before you can pass it to that code, just so that you can append a NUL byte to the end of the string so that that C-string-expecting code can use it. That will be CPU-inefficient and error-prone.

Related

How to convert QByteArray to a byte string?

I have a QByteArray object with 256 bytes inside of it. However, when I try to convert it to a byte string (std::string), it comes up with a length of 0 to 256. It is very inconsistent with how long the string is, but there are always 256 bytes in the array. This data is encrypted and as such I expect a 256-character garbage output, but instead I am getting a random number of bytes.
Here is the code I used to try to convert it:
// Fills array with 256 bytes (I have tried read(256) and got the same random output)
QByteArray byteRecv = socket->read(2048);
// Gives out random garbage (not the 256-character garbage that I need)
string recv = byteRecv.constData();
The socket object is a QTcpSocket* if it's necessary to know.
Is there any way I can get an exact representation of what's in the array? I have tried converting it to a QString and using the QByteArray::toStdString() method, but neither of those worked to solve the problem.
QByteArray::constData() member function returns a raw pointer const char*. The constructor of std::string from a raw pointer
std::string(const char* s);
constructs the string with the contents initialized with a copy of the null-terminated character string pointed to by s. The length of the string is determined by the first null character. If s does not point to such a string, the behaviour is undefined.
Your buffer is not a null-terminated string and can contain null characters in the middle. So you should use another constructor
std::string(const char* s, std::size_type count);
that constructs the string with the first count characters of character string pointed to by s.
That is:
std::string recv(byteRecv.constData(), 256);
For a collection of raw bytes, std::vector might be a better choice. You can construct it from two pointers:
std::vector<char> recv(byteRecv.constData(), byteRecv.constData() + 256);

sprintf() is adding an extra variable

Why is this happening?:
char buf[256];
char date[8];
sprintf(date, "%d%02d%02d", Time.year(), Time.month(), Time.day());
snprintf(buf, sizeof(buf), "{\"team\":\"%s\"}", team.c_str());
Serial.println(date);
output:
20180202{"team":"IND"}
it should only be: 20180202
I don't know why {"team":"IND"} is getting added to the end of it.
Very likely you declared two arrays and they are lined up in a way that allowed for the buf to overwrite the null terminator of date and thus it's "concatenating" the two.
I can't write code to reproduce this because it's undefined behavior and thus not reliable. But I can tell you how you can avoid it,
snprintf(date, sizeof(date), "%d%02d%02d", Time.year(), Time.month(), Time.day());
snprintf(buf, sizeof(buf), "{\"team\":\"%s\"}", team.c_str());
Having said that, why are you using snprintf() when this appears to be c++? And so there are more suitable solutions for this kind of problem.
This would print an incorrect value, but would not cause any unexpected behavior.
Strings in c are simply arrays with a special arrangement. If the string has n printable characters it should be stored in an array of size n + 1, so that you can add what is called a null terminator. It's a special value that indicates the end of the string.
Your second snprintf() is overwriting such null terminator of the date array and thus appearing to concatenate both strings.
You have reserved space to store exactly 8 chars:
char date[8];
To store the date properly 20180202 you need
char date[9];
because sprintf() puts the extra '\0' character to the buffer you pass for proper c-style string termination.
I'd suspect you declared your buffers like
char buffer[???];
char date[8];
since these are most likely stored on your processors stack, you need to read that backwards, thus the output placed at buffer overwrites that terminating '\0', and appears immediately after date.

Why does std::string("\x00") report length of 0?

I have a function which needs to encode strings, which needs to be able to accept 0x00 as a valid 'byte'. My program needs to check the length of the string, however if I pass in "\x00" to std::string the length() method returns 0.
How can I get the actual length even if the string is a single null character?
std::string is perfectly capable of storing nulls. However, you have to be wary, as const char* is not, and you very briefly construct a const char*, from which you create the std::string.
std::string a("\x00");
This creates a constant C string containing only the null character, followed by a null terminator. But C strings don't know how long they are; so the string thinks it runs until the first null terminator, which is the first character. Hence, a zero-length string is created.
std::string b("");
b.push_back('\0');
std::string is null-clean. Characters (\0) can be the zero byte freely as well. So, here, there is nothing stopping us from correctly reading the data structure. The length of b will be 1.
In general, you need to avoid constructing C strings containing null characters. If you read the input from a file directly into std::string or make sure to push the characters one at a time, you can get the result you want. If you really need a constant string with null characters, consider using some other sentinel character instead of \0 and then (if you really need it) replace those characters with '\0' after loading into std::string.
You're passing in an empty string. Use std::string(1, '\0') instead.
Or std::string{ '\0' } (thanks, #zett42)
With C++14, you can use a string literal operator to store strings with null bytes:
using namespace std::string_literals;
std::string a = "\0"s;
std::string aa = "\0\0"s; // two null bytes are supported too

Why wstring can accept WCHAR[] while string doesn't accept UCHAR[]

I am trying to print the returned value of NtQueryValueKey which is UCHAR Data[1]; i have tried printf, cout, and string(Data, DataLengh), with the first two printing only 1 character and the last one throws an exception... Basically if i changed the Data Type to WCHAR Data[1] and used wstring(Data) it accepts it normally without any complain... also wprintf prints the value normally.
Edit: I meant NtQueryValueKey using the KEY_VALUE_PARTIAL_INFORMATION, I am using VS 2015 btw...
You must have mixed something up. You did not specify what value from the KEY_NAME_INFORMATION enumeration you are using for the second parameter to specify the data type, but a quick look at MSDN shows that all of the structures contain WCHAR Name[1]; or something similar as the last member (which I guess is the one you are interested in). Can you elaborate and provide the link or other means of documentation that states you actually need to use UCHAR ?
WCHAR is an alias for wchar_t. std::wstring operates with wchar_t elements. A WCHAR[] can decay to a wchar_t*, and thus can be assigned directly to a std::wstring.
UCHAR is an alias for unsigned char. std::string operates with char elements instead. A UCHAR[]/UCHAR* cannot be assigned directly to a std::string without a type-cast to char*, as char and unsigned char are distinct data types.
unsigned char is commonly used to represent 8bit bytes (it is the same data type used for BYTE).
NtQueryKey() returns strings as UTF-16LE encoded bytes using WCHAR[] character arrays, not UCHAR[] byte arrays. So your code is declaring things wrong if you are using UCHAR[] to begin with. But even so, you can use UCHAR if you pay attention to the encoding and byte length, and use appropriate type-casts.
Any associated Length value reported by NtQueryKey() is expressed in bytes, not characters. sizeof(UCHAR) is 1 and sizeof(WCHAR) is 2. So every 2 UCHARs represents 1 WCHAR. And the strings are not null-terminated, so you have to take the Length into account when printing or converting.
In Latin-based languages, most commonly used Unicode characters will be <= U+00FF, and thus every other UCHAR in UTF-16LE will usually be 0. That is interpreted as a null terminator when UTF-16 is printed with printf() or std::cout. You need to use wprintf() or std::wcout instead.
Converting Data to a std::string is a valid operation and should not be raising an exception:
std::string((char*)Data, DataLength)
Provided that:
Data is a valid pointer.
DataLength is an accurate byte count.
The only way this could raise an exception is if either:
Data is not pointing at valid memory.
the value of DataLength is more than the actual number of bytes allocated for Data.
available memory is too low to allocate std::string's internal buffer.
memory is corrupted.
Assigning Data by itself to a std::wstring without taking DataLength into account is not a valid operation because the strings are not null-terminated. You must specify the length:
std::wstring(Data, DataLength / sizeof(WCHAR))
If Data is UCHAR then use a type-cast:
std::wstring((WCHAR*)Data, DataLength / sizeof(WCHAR))
When printing Data directly with wprintf(), you must pass DataLength as an input parameter:
wprintf(L"%.*s", DataLength / sizeof(WCHAR), Data);
When printing Data directly with std::wcout, you should use write() instead of operator<< so you can pass DataLength as an input parameter:
std::wcout.write(Data, DataLength / sizeof(WCHAR));
If Data is UCHAR then use a type-cast:
std::wcout.write((WCHAR*)Data, DataLength / sizeof(WCHAR));

Struggling to convert vector<char> to wstring

I need to convert utf16 text to utf8. The actual conversion code is simple:
std::wstring in(...);
std::string out = boost::locale::conv::utf_to_utf<char, wchar_t>(in);
However the issue is that the UTF16 is read from a file and it may or may not contain BOM. My code needs to be portable (minimum is windows/osx/linux). I'm really struggling to figure out how to create a wstring from the byte sequence.
EDIT: this is not a duplicate of the linked question, as in that question the OP needs to convert a wide string into an array of bytes - and I need to convert the other way around.
You should not use wide types at all in your case.
Assuming you can get a char * from your vector<char>, you can stick to bytes by using the following code:
char * utf16_buffer = &my_vector_of_chars[0];
char * buffer_end = &my_vector_of_chars[vector.size()];
std::string utf8_str = boost::locale::conv::between(utf16_buffer, buffer_end, "UTF-8", "UTF-16");
between operates on 8-bit characters and allows you to avoid conversion to 16-bit characters altogether.
It is necessary to use the between overload that uses the pointer to the buffer's end, because by default, between will stop at the first '\0' character in the string, which will be almost immediately because the input is UTF-16.