Hi I have a few typedefs:
typedef unsigned char Byte;
typedef std::vector<Byte> ByteVector;
typedef std::wstring String;
I need to convert String into ByteVector, I have tried this:
String str = L"123";
ByteVector vect(str.begin(), str.end());
As a result vectror contains 3 elements: 1, 2, 3. However it is wstring so every charcter in this string is wide so my expected result would be: 0, 1, 0, 2, 0, 3.
Is there any standart way to do that or I need to write some custom function.
Byte const* p = reinterpret_cast<Byte const*>(&str[0]);
std::size_t size = str.size() * sizeof(str.front());
ByteVector vect(p, p+size);
What is your actual goal? If you just want to get the bytes representing the wchar_t objects, a fairly trivial conversion would do the trick although I wouldn't use just a cast to to unsigned char const* but rather an explicit conversion.
On the other hand, if you actually want to convert the std::wstring into a sequence encoded using e.g. UTF8 or UTF16 as is usually the case when dealing with characters, the conversion used for the encoding becomes significantly more complex. Probably the easiest approach to convert to an encoding is to use C's wcstombs():
std::vector<char> target(source.size() * 4);
size_t n = wcstombs(&target[0], &source[0], target.size());
The above fragment assumes that source isn't empty and that the last wchar_t in source is wchar_t(). The conversion uses C's global locale and assumes to convert whatever character encoding is set up there. There is also a version wcstombs_l() where you can specify the locale.
C++ has similar functionality but it is a bit harder to use in the std::codecvt<...> facet. I can provide an example if necessary.
Related
How do I convert a single char/wchar to a single-character string/wstring in d? I can't find anything online that doesn't talk about char* or wchar*.
As strings are just immutable(char)[], you can construct them like any other array with chars:
char a = 'a';
string s = [a];
There's a few different options. One is to get a pointer by just taking the address of it. You generally shouldn't use this but you should be aware it is possible.
char a = 'a';
char[] b = (&a)[0 .. 1]; // &a gets a pointer, [0..1] slices the single element
string c = b.idup; // copy it into a new string
If you used a wchar you could get a wstring out of it this way. Then std.conv.to can convert between string and wstring.
Speaking of std.conv.to, that's the next option and is actually the easiest:
import std.conv;
char a = 'a'; // or wchar
string b = to!string(a); // or to!wstring
In the real world I'd probably suggest you use this for maximum convenience and simplicity, but you lose a bit of efficiency in some cases.
Thus, the third option I'll present is std.utf.encode.
import std.utf;
char[4] buffer;
auto len = encode(buffer, a); // put the char in the buffer
writeln(buffer[0 .. len]); // slice the buffer. idup it if you want string specifically
This works for any input: char, wchar, or dchar, and will encode multi-byte code points into the string as well. To get a wstring, use wchar[2] for the buffer isntead. This is a good balance of correctness and efficiency, just at the trade of being a little less convenient.
Before(using ASCII) i was using std::string as buffer like this:
std::string test = "";
int value = 6;
test.append("some string");
test.append((char*)value, 4);
test.append("some string");
with expected value in test:
"some srtring\x6\x0\x0\x0somestring"
Now i am tring to use Unicode and i wanna keep the same "code" but trubles happens:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 4); (buffer overflow cause reading 8 bytes)
test.append("some string");
How can i append bytes like in std::string?
Doing:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 2);
test.append("some string");
Solve partially the problem cause after i can't append bools.
EDIT:
i can even use wstringstream if a binary copy is applied.(normally not)
You're confusing unicode and character encodings. An std::string can represent unicode code points just fine, using the UTF-8 encoding.
Windows uses the UTF-16LE (or UTF-16 with a BOM, I believe) encoding to represent unicode glyphs. Most others use UTF-8.
An std::string which is encoded in UTF-8 and which uses only ASCII characters can actually be interpreted as an ASCII string. This is the beauty of UTF-8. It's a natural extension.
Anyway,
i need a "binary" dynamic buffer, where i can add the real size of types(bool 1, int 4 etc)
An std::vector<uint8_t> is probably more suitable for this task. It communicates that it is not something human-readable, per se. If you need to embed strings into this buffer, make sure that sizeof(char) == sizeof(uint8_t) on the platform, and then just write the data as-is to this buffer.
If you're saving this buffer on one machine and try to read it on another machine, you have to take care of endianness too.
You make a function that reads the stuff you want to put:
void putBytes(std::wstring& s, char* c, int numBytes)
{
while (numBytes-- > 0)
s += (wchar_t)*c++;
}
Then you can call it:
int value = 65;
putBytes(s, reinterpret_cast<char*>(&value), sizeof(value));
I think a IStream is a proper way to do this...i'll make an interface to handle different types. I was abusing std::string for an easy "dynamic binary array", with std::wstring this is not possible,for many reasons but most silly one is that require at least 2 bytes, so no room for a bool
How can/should I cast from a unsigned char array to a widechar array wchar_t or std::wstring? And how can I convert it back to a unsigned char array?
Or can OpenSSL produce a widechar hash from SHA256_Update?
Try the following:
#include <cstdlib>
using namespace std;
unsigned char* temp; // pointer to initial data
// memory allocation and filling
// calculation of string length
wchar_t* wData = new wchar_t[len+1];
mbstowcs(&wData[0], &temp1[0], len);
Сoncerning inverse casting look the example here or just use mbstowcs once again but with changing places of two first arguments.
Also WideCharToMultiByte function can be useful for Windows development, and setting locale should be considered as well (see some examples).
UPDATE:
To calculate length of string pointed by unsigned char* temp the following approach can be used:
const char* ccp = reinterpret_cast<const char*>(temp);
size_t len = mbstowcs(nullptr, &ccp[0], 0);
std::setlocale(LC_ALL, "en_US.utf8");
const char* mbstr = "hello";
std::mbstate_t state = std::mbstate_t();
// calc length
int len = 1 + std::mbsrtowcs(nullptr, &mbstr, 0, &state);
std::vector<wchar_t> wstr(len);
std::mbsrtowcs(&wstr[0], &mbstr, wstr.size(), &state);
How can/should I cast from a unsigned char array to a widechar array wchar_t or std::wstring? And how can I convert it back to a unsigned char array?
They are completely different, so you should not be doing it under most circumstances. If you provide a specific question with real code, than we can probably tell you more.
Or can OpenSSL produce a widechar hash from SHA256_Update?
No, OpenSSL cannot do this. It produces hashes which are binary strings cmposed of bytes, not chars. You are responsible for for presentation details, like narrow/wide character sets or base64 encoding.
I need to convert utf16 text to utf8. The actual conversion code is simple:
std::wstring in(...);
std::string out = boost::locale::conv::utf_to_utf<char, wchar_t>(in);
However the issue is that the UTF16 is read from a file and it may or may not contain BOM. My code needs to be portable (minimum is windows/osx/linux). I'm really struggling to figure out how to create a wstring from the byte sequence.
EDIT: this is not a duplicate of the linked question, as in that question the OP needs to convert a wide string into an array of bytes - and I need to convert the other way around.
You should not use wide types at all in your case.
Assuming you can get a char * from your vector<char>, you can stick to bytes by using the following code:
char * utf16_buffer = &my_vector_of_chars[0];
char * buffer_end = &my_vector_of_chars[vector.size()];
std::string utf8_str = boost::locale::conv::between(utf16_buffer, buffer_end, "UTF-8", "UTF-16");
between operates on 8-bit characters and allows you to avoid conversion to 16-bit characters altogether.
It is necessary to use the between overload that uses the pointer to the buffer's end, because by default, between will stop at the first '\0' character in the string, which will be almost immediately because the input is UTF-16.
here's what I am trying to do:
typedef uint16_t uchar16_t;
uchar16_t buf[32];
// buf will contain timezone information like GMT-6, Eastern Daylight Time, etc
char * str = "Test";
for (int i = 0; i <= strlen(str); i++)
buf[i] = str[i];
I guess that's not correct since uchar16_t would contain 2 bytes and str contains 1 byte.
What is it that I am supposed to do ?
Strlen? buf[32]? Trying to destroy the universe?
You want to use a wstringstream.
std::wstringstream lols;
lols << "Test";
std::wstring cakes;
lols >> cakes;
Edit#Comment:
You shouldn't use strlen because any decent string system allows embedded zeros, and strlen is seriously slow. In addition, you didn't resize your buffer as needed, so if you had a string of size > 31 you would get a buffer overflow. In addition, you would have to (if you did dynamically size your buffer) manually free it afterwards. Both of these things are serious failings of the C string system. My example code makes your standard library writer do all the work and avoid all these problems for you.
That's actually OK if your string will always be ASCII. To do it correctly, the portable function is mbstowcs which assumes you're converting from the default locale or if you're on Windows then there's API functions that let you specify the source code page explicitly.
Your code will work, as long as str is ASCII; calling strlen() in the loop condition is probably a bad idea, though. It might be easier to just use swprintf() if it's available on your system:
uchar16_t buf[32];
char *str = "Test";
swprintf(buf, sizeof buf, "%s", str);
Have a look here.
Also, is there a good reason you are defining your own type?
If you have a (narrow) char string, you cannot convert it to
a wchar_t string by setting your locale to "C" and then passing
the string through mbstowcs(). That's because the "C" locale specifies
a -particular- character encoding, and that encoding might not match
the encoding of the execution character set, so mbstowcs() might
map the characters to something unexpected, or could even fail
(if the execution character set happened to use encodings that
were incompatible with the encoding structure for the C locale
character set.)
Thus, in order to convert a char
string into a wider string, you have
to copy the chars one by one into an
array of wchar_t . If you need to work
with Unicode or utf-16 or whatever
after that, then wcstombs() is what
you should look at.