I need to convert utf16 text to utf8. The actual conversion code is simple:
std::wstring in(...);
std::string out = boost::locale::conv::utf_to_utf<char, wchar_t>(in);
However the issue is that the UTF16 is read from a file and it may or may not contain BOM. My code needs to be portable (minimum is windows/osx/linux). I'm really struggling to figure out how to create a wstring from the byte sequence.
EDIT: this is not a duplicate of the linked question, as in that question the OP needs to convert a wide string into an array of bytes - and I need to convert the other way around.
You should not use wide types at all in your case.
Assuming you can get a char * from your vector<char>, you can stick to bytes by using the following code:
char * utf16_buffer = &my_vector_of_chars[0];
char * buffer_end = &my_vector_of_chars[vector.size()];
std::string utf8_str = boost::locale::conv::between(utf16_buffer, buffer_end, "UTF-8", "UTF-16");
between operates on 8-bit characters and allows you to avoid conversion to 16-bit characters altogether.
It is necessary to use the between overload that uses the pointer to the buffer's end, because by default, between will stop at the first '\0' character in the string, which will be almost immediately because the input is UTF-16.
Related
Before(using ASCII) i was using std::string as buffer like this:
std::string test = "";
int value = 6;
test.append("some string");
test.append((char*)value, 4);
test.append("some string");
with expected value in test:
"some srtring\x6\x0\x0\x0somestring"
Now i am tring to use Unicode and i wanna keep the same "code" but trubles happens:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 4); (buffer overflow cause reading 8 bytes)
test.append("some string");
How can i append bytes like in std::string?
Doing:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 2);
test.append("some string");
Solve partially the problem cause after i can't append bools.
EDIT:
i can even use wstringstream if a binary copy is applied.(normally not)
You're confusing unicode and character encodings. An std::string can represent unicode code points just fine, using the UTF-8 encoding.
Windows uses the UTF-16LE (or UTF-16 with a BOM, I believe) encoding to represent unicode glyphs. Most others use UTF-8.
An std::string which is encoded in UTF-8 and which uses only ASCII characters can actually be interpreted as an ASCII string. This is the beauty of UTF-8. It's a natural extension.
Anyway,
i need a "binary" dynamic buffer, where i can add the real size of types(bool 1, int 4 etc)
An std::vector<uint8_t> is probably more suitable for this task. It communicates that it is not something human-readable, per se. If you need to embed strings into this buffer, make sure that sizeof(char) == sizeof(uint8_t) on the platform, and then just write the data as-is to this buffer.
If you're saving this buffer on one machine and try to read it on another machine, you have to take care of endianness too.
You make a function that reads the stuff you want to put:
void putBytes(std::wstring& s, char* c, int numBytes)
{
while (numBytes-- > 0)
s += (wchar_t)*c++;
}
Then you can call it:
int value = 65;
putBytes(s, reinterpret_cast<char*>(&value), sizeof(value));
I think a IStream is a proper way to do this...i'll make an interface to handle different types. I was abusing std::string for an easy "dynamic binary array", with std::wstring this is not possible,for many reasons but most silly one is that require at least 2 bytes, so no room for a bool
I am trying to print the returned value of NtQueryValueKey which is UCHAR Data[1]; i have tried printf, cout, and string(Data, DataLengh), with the first two printing only 1 character and the last one throws an exception... Basically if i changed the Data Type to WCHAR Data[1] and used wstring(Data) it accepts it normally without any complain... also wprintf prints the value normally.
Edit: I meant NtQueryValueKey using the KEY_VALUE_PARTIAL_INFORMATION, I am using VS 2015 btw...
You must have mixed something up. You did not specify what value from the KEY_NAME_INFORMATION enumeration you are using for the second parameter to specify the data type, but a quick look at MSDN shows that all of the structures contain WCHAR Name[1]; or something similar as the last member (which I guess is the one you are interested in). Can you elaborate and provide the link or other means of documentation that states you actually need to use UCHAR ?
WCHAR is an alias for wchar_t. std::wstring operates with wchar_t elements. A WCHAR[] can decay to a wchar_t*, and thus can be assigned directly to a std::wstring.
UCHAR is an alias for unsigned char. std::string operates with char elements instead. A UCHAR[]/UCHAR* cannot be assigned directly to a std::string without a type-cast to char*, as char and unsigned char are distinct data types.
unsigned char is commonly used to represent 8bit bytes (it is the same data type used for BYTE).
NtQueryKey() returns strings as UTF-16LE encoded bytes using WCHAR[] character arrays, not UCHAR[] byte arrays. So your code is declaring things wrong if you are using UCHAR[] to begin with. But even so, you can use UCHAR if you pay attention to the encoding and byte length, and use appropriate type-casts.
Any associated Length value reported by NtQueryKey() is expressed in bytes, not characters. sizeof(UCHAR) is 1 and sizeof(WCHAR) is 2. So every 2 UCHARs represents 1 WCHAR. And the strings are not null-terminated, so you have to take the Length into account when printing or converting.
In Latin-based languages, most commonly used Unicode characters will be <= U+00FF, and thus every other UCHAR in UTF-16LE will usually be 0. That is interpreted as a null terminator when UTF-16 is printed with printf() or std::cout. You need to use wprintf() or std::wcout instead.
Converting Data to a std::string is a valid operation and should not be raising an exception:
std::string((char*)Data, DataLength)
Provided that:
Data is a valid pointer.
DataLength is an accurate byte count.
The only way this could raise an exception is if either:
Data is not pointing at valid memory.
the value of DataLength is more than the actual number of bytes allocated for Data.
available memory is too low to allocate std::string's internal buffer.
memory is corrupted.
Assigning Data by itself to a std::wstring without taking DataLength into account is not a valid operation because the strings are not null-terminated. You must specify the length:
std::wstring(Data, DataLength / sizeof(WCHAR))
If Data is UCHAR then use a type-cast:
std::wstring((WCHAR*)Data, DataLength / sizeof(WCHAR))
When printing Data directly with wprintf(), you must pass DataLength as an input parameter:
wprintf(L"%.*s", DataLength / sizeof(WCHAR), Data);
When printing Data directly with std::wcout, you should use write() instead of operator<< so you can pass DataLength as an input parameter:
std::wcout.write(Data, DataLength / sizeof(WCHAR));
If Data is UCHAR then use a type-cast:
std::wcout.write((WCHAR*)Data, DataLength / sizeof(WCHAR));
I'm discovered that char* in QByteArray have null bytes. Code:
QByteArray arr;
QDataStream stream(&arr, QIODevice::WriteOnly);
stream << "hello";
Look at debugger variable view:
I don't understand why I have three empty bytes at the beginning. I know that [3] byte is string length. Can I remove last byte? I know it's null-terminated string, but for my application I must have raw bytes (with one byte at beggining for store length).
More weird for me is when I use QString:
QString str = "hello";
[rest of code same as above]
stream << str;
It's don't have null at end, so I think maybe null bytes before each char informs that next byte is char?
Just two questions:
Why so much null bytes?
How I can remove it, including last null byte?
I don't understand why I have three empty bytes at the beginning.
It's a fixed-size, uint32_t (4-byte) header. It's four bytes so that it can specify data lengths as long as (2^32-1) bytes. If it was only a single byte, then it would only be able to describe strings up to 255 bytes long, because that's the largest integer value that can fit into a single byte.
Can I remove last byte? I know it's null-terminated string, but for my
application I must have raw bytes (with one byte at beggining for
store length).
Sure, as long as the code that will later parse the data array is not depending on the presence of a trailing NUL byte to work correctly.
More weird for me is when I use QString [...] it's don't have null at end, so I think maybe null bytes before each char informs that next byte is char?
Per the Qt serialization documentation page, a QString is serialized as:
- If the string is null: 0xFFFFFFFF (quint32)
- Otherwise: The string length in bytes (quint32) followed by the data in UTF-16.
If you don't like that format, instead of serializing the QString directly, you could do something like
stream << str.toUtf8();
instead, and that way the data in your QByteArray would be in a simpler format (UTF-8).
Why so much null bytes?
They are used in fixed-size header fields when the length-values being encoded are small; or to indicate the end of NUL-terminated C strings.
How I can remove it, including last null byte?
You could add the string in your preferred format (no NUL terminator but with a single length header-byte) like this:
const char * hello = "hello";
char slen = strlen(hello);
stream.writeRawData(&slen, 1);
stream.writeRawData(hello, slen);
... but if you have the choice, I highly recommend just keeping the NUL-terminator bytes at the end of the strings, for these reasons:
A single preceding length-byte will limit your strings to 255 bytes long (or less), which is an unnecessary restriction that will likely haunt you in the future.
Avoiding the NUL-terminator byte doesn't actually save any space, because you've added a string-length byte to compensate.
If the NUL-terminator byte is there, you can simply pass a pointer to the first byte of the string directly to any code expects a C-style string, and it will be able to use the string immediately (without any data-conversion steps). If you rely on a different convention instead, you'll end up having to make a copy of the entire string before you can pass it to that code, just so that you can append a NUL byte to the end of the string so that that C-string-expecting code can use it. That will be CPU-inefficient and error-prone.
Hi I have a few typedefs:
typedef unsigned char Byte;
typedef std::vector<Byte> ByteVector;
typedef std::wstring String;
I need to convert String into ByteVector, I have tried this:
String str = L"123";
ByteVector vect(str.begin(), str.end());
As a result vectror contains 3 elements: 1, 2, 3. However it is wstring so every charcter in this string is wide so my expected result would be: 0, 1, 0, 2, 0, 3.
Is there any standart way to do that or I need to write some custom function.
Byte const* p = reinterpret_cast<Byte const*>(&str[0]);
std::size_t size = str.size() * sizeof(str.front());
ByteVector vect(p, p+size);
What is your actual goal? If you just want to get the bytes representing the wchar_t objects, a fairly trivial conversion would do the trick although I wouldn't use just a cast to to unsigned char const* but rather an explicit conversion.
On the other hand, if you actually want to convert the std::wstring into a sequence encoded using e.g. UTF8 or UTF16 as is usually the case when dealing with characters, the conversion used for the encoding becomes significantly more complex. Probably the easiest approach to convert to an encoding is to use C's wcstombs():
std::vector<char> target(source.size() * 4);
size_t n = wcstombs(&target[0], &source[0], target.size());
The above fragment assumes that source isn't empty and that the last wchar_t in source is wchar_t(). The conversion uses C's global locale and assumes to convert whatever character encoding is set up there. There is also a version wcstombs_l() where you can specify the locale.
C++ has similar functionality but it is a bit harder to use in the std::codecvt<...> facet. I can provide an example if necessary.
here's what I am trying to do:
typedef uint16_t uchar16_t;
uchar16_t buf[32];
// buf will contain timezone information like GMT-6, Eastern Daylight Time, etc
char * str = "Test";
for (int i = 0; i <= strlen(str); i++)
buf[i] = str[i];
I guess that's not correct since uchar16_t would contain 2 bytes and str contains 1 byte.
What is it that I am supposed to do ?
Strlen? buf[32]? Trying to destroy the universe?
You want to use a wstringstream.
std::wstringstream lols;
lols << "Test";
std::wstring cakes;
lols >> cakes;
Edit#Comment:
You shouldn't use strlen because any decent string system allows embedded zeros, and strlen is seriously slow. In addition, you didn't resize your buffer as needed, so if you had a string of size > 31 you would get a buffer overflow. In addition, you would have to (if you did dynamically size your buffer) manually free it afterwards. Both of these things are serious failings of the C string system. My example code makes your standard library writer do all the work and avoid all these problems for you.
That's actually OK if your string will always be ASCII. To do it correctly, the portable function is mbstowcs which assumes you're converting from the default locale or if you're on Windows then there's API functions that let you specify the source code page explicitly.
Your code will work, as long as str is ASCII; calling strlen() in the loop condition is probably a bad idea, though. It might be easier to just use swprintf() if it's available on your system:
uchar16_t buf[32];
char *str = "Test";
swprintf(buf, sizeof buf, "%s", str);
Have a look here.
Also, is there a good reason you are defining your own type?
If you have a (narrow) char string, you cannot convert it to
a wchar_t string by setting your locale to "C" and then passing
the string through mbstowcs(). That's because the "C" locale specifies
a -particular- character encoding, and that encoding might not match
the encoding of the execution character set, so mbstowcs() might
map the characters to something unexpected, or could even fail
(if the execution character set happened to use encodings that
were incompatible with the encoding structure for the C locale
character set.)
Thus, in order to convert a char
string into a wider string, you have
to copy the chars one by one into an
array of wchar_t . If you need to work
with Unicode or utf-16 or whatever
after that, then wcstombs() is what
you should look at.