Before(using ASCII) i was using std::string as buffer like this:
std::string test = "";
int value = 6;
test.append("some string");
test.append((char*)value, 4);
test.append("some string");
with expected value in test:
"some srtring\x6\x0\x0\x0somestring"
Now i am tring to use Unicode and i wanna keep the same "code" but trubles happens:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 4); (buffer overflow cause reading 8 bytes)
test.append("some string");
How can i append bytes like in std::string?
Doing:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 2);
test.append("some string");
Solve partially the problem cause after i can't append bools.
EDIT:
i can even use wstringstream if a binary copy is applied.(normally not)
You're confusing unicode and character encodings. An std::string can represent unicode code points just fine, using the UTF-8 encoding.
Windows uses the UTF-16LE (or UTF-16 with a BOM, I believe) encoding to represent unicode glyphs. Most others use UTF-8.
An std::string which is encoded in UTF-8 and which uses only ASCII characters can actually be interpreted as an ASCII string. This is the beauty of UTF-8. It's a natural extension.
Anyway,
i need a "binary" dynamic buffer, where i can add the real size of types(bool 1, int 4 etc)
An std::vector<uint8_t> is probably more suitable for this task. It communicates that it is not something human-readable, per se. If you need to embed strings into this buffer, make sure that sizeof(char) == sizeof(uint8_t) on the platform, and then just write the data as-is to this buffer.
If you're saving this buffer on one machine and try to read it on another machine, you have to take care of endianness too.
You make a function that reads the stuff you want to put:
void putBytes(std::wstring& s, char* c, int numBytes)
{
while (numBytes-- > 0)
s += (wchar_t)*c++;
}
Then you can call it:
int value = 65;
putBytes(s, reinterpret_cast<char*>(&value), sizeof(value));
I think a IStream is a proper way to do this...i'll make an interface to handle different types. I was abusing std::string for an easy "dynamic binary array", with std::wstring this is not possible,for many reasons but most silly one is that require at least 2 bytes, so no room for a bool
Related
How do I convert a single char/wchar to a single-character string/wstring in d? I can't find anything online that doesn't talk about char* or wchar*.
As strings are just immutable(char)[], you can construct them like any other array with chars:
char a = 'a';
string s = [a];
There's a few different options. One is to get a pointer by just taking the address of it. You generally shouldn't use this but you should be aware it is possible.
char a = 'a';
char[] b = (&a)[0 .. 1]; // &a gets a pointer, [0..1] slices the single element
string c = b.idup; // copy it into a new string
If you used a wchar you could get a wstring out of it this way. Then std.conv.to can convert between string and wstring.
Speaking of std.conv.to, that's the next option and is actually the easiest:
import std.conv;
char a = 'a'; // or wchar
string b = to!string(a); // or to!wstring
In the real world I'd probably suggest you use this for maximum convenience and simplicity, but you lose a bit of efficiency in some cases.
Thus, the third option I'll present is std.utf.encode.
import std.utf;
char[4] buffer;
auto len = encode(buffer, a); // put the char in the buffer
writeln(buffer[0 .. len]); // slice the buffer. idup it if you want string specifically
This works for any input: char, wchar, or dchar, and will encode multi-byte code points into the string as well. To get a wstring, use wchar[2] for the buffer isntead. This is a good balance of correctness and efficiency, just at the trade of being a little less convenient.
I need to convert utf16 text to utf8. The actual conversion code is simple:
std::wstring in(...);
std::string out = boost::locale::conv::utf_to_utf<char, wchar_t>(in);
However the issue is that the UTF16 is read from a file and it may or may not contain BOM. My code needs to be portable (minimum is windows/osx/linux). I'm really struggling to figure out how to create a wstring from the byte sequence.
EDIT: this is not a duplicate of the linked question, as in that question the OP needs to convert a wide string into an array of bytes - and I need to convert the other way around.
You should not use wide types at all in your case.
Assuming you can get a char * from your vector<char>, you can stick to bytes by using the following code:
char * utf16_buffer = &my_vector_of_chars[0];
char * buffer_end = &my_vector_of_chars[vector.size()];
std::string utf8_str = boost::locale::conv::between(utf16_buffer, buffer_end, "UTF-8", "UTF-16");
between operates on 8-bit characters and allows you to avoid conversion to 16-bit characters altogether.
It is necessary to use the between overload that uses the pointer to the buffer's end, because by default, between will stop at the first '\0' character in the string, which will be almost immediately because the input is UTF-16.
Hi I have a few typedefs:
typedef unsigned char Byte;
typedef std::vector<Byte> ByteVector;
typedef std::wstring String;
I need to convert String into ByteVector, I have tried this:
String str = L"123";
ByteVector vect(str.begin(), str.end());
As a result vectror contains 3 elements: 1, 2, 3. However it is wstring so every charcter in this string is wide so my expected result would be: 0, 1, 0, 2, 0, 3.
Is there any standart way to do that or I need to write some custom function.
Byte const* p = reinterpret_cast<Byte const*>(&str[0]);
std::size_t size = str.size() * sizeof(str.front());
ByteVector vect(p, p+size);
What is your actual goal? If you just want to get the bytes representing the wchar_t objects, a fairly trivial conversion would do the trick although I wouldn't use just a cast to to unsigned char const* but rather an explicit conversion.
On the other hand, if you actually want to convert the std::wstring into a sequence encoded using e.g. UTF8 or UTF16 as is usually the case when dealing with characters, the conversion used for the encoding becomes significantly more complex. Probably the easiest approach to convert to an encoding is to use C's wcstombs():
std::vector<char> target(source.size() * 4);
size_t n = wcstombs(&target[0], &source[0], target.size());
The above fragment assumes that source isn't empty and that the last wchar_t in source is wchar_t(). The conversion uses C's global locale and assumes to convert whatever character encoding is set up there. There is also a version wcstombs_l() where you can specify the locale.
C++ has similar functionality but it is a bit harder to use in the std::codecvt<...> facet. I can provide an example if necessary.
EDIT I modified the question after realizing it was wrong to begin with.
I'm porting part of a C# application to Linux, where I need to get the bytes of a UTF-16 string:
string myString = "ABC";
byte[] bytes = Encoding.Unicode.GetBytes(myString);
So that the bytes array is now:
"65 00 66 00 67 00" (bytes)
How can I achieve the same in C++ on Linux? I have a myString defined as std::string, and it seems that std::wstring on Linux is 4 bytes?
You question isn't really clear, but I'll try to clear up some confusion.
Introduction
Status of the handling of character set in C (and that was inherited by C++) after the '95 amendment to the C standard.
the character set used is given by the current locale
wchar_t is meant to store code point
char is meant to store a multibyte encoded form (a constraint for instance is that characters in the basic character set must be encoded in one byte)
string literals are encoded in an implementation defined manner. If they use characters outside of the basic character set, you can't assume they are valid in all locale.
Thus with a 16 bits wchar_t you are restricted to the BMP. Using the surrogates of UTF-16 is not compliant but I think MS and IBM are more or less forced to do this because they believed Unicode when they said they'll forever be a 16 bits charset. Those who delayed their Unicode support tend to use a 32 bits wchar_t.
Newer standards don't change much. Mostly there are literals for UTF-8, UTF-16 and UTF-32 encoded strings and there are types for 16 bits and 32 bits char. There is little or no additional support for Unicode in the standard libraries.
How to do the transformation of one encoding to the other
You have to be in a locale which use Unicode. Hopefully
std::locale::global(locale(""));
will be enough for that. If not, your environment is not properly setup (or setup for another charset and assuming Unicode won't be a service to your user.).
C Style
Use the wcstomsb and mbstowcs functions. Here is an example for what you asked.
std::string narrow(std::wstring const& s)
{
std::vector<char> result(4*s.size() + 1);
size_t used = wcstomsb(&result[0], s.data(), result.size());
assert(used < result.size());
return result.data();
}
C++ Style
The codecvt facet of the locale provide the needed functionality. The advantage is that you don't have to change the global locale for using it. The inconvenient is that the usage is more complex.
#include <locale>
#include <iostream>
#include <string>
#include <vector>
#include <assert.h>
#include <iomanip>
std::string narrow(std::wstring const& s,
std::locale loc = std::locale())
{
std::vector<char> result(4*s.size() + 1);
wchar_t const* fromNext;
char* toNext;
mbstate_t state = {0};
std::codecvt_base::result convResult
= std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
.out(state,&s[0], &s[s.size()], fromNext,
&result[0], &result[result.size()], toNext);
assert(fromNext == &s[s.size()]);
assert(toNext != &result[result.size()]);
assert(convResult == std::codecvt_base::ok);
*toNext = '\0';
return &result[0];
}
std::wstring widen(std::string const& s,
std::locale loc = std::locale())
{
std::vector<wchar_t> result(s.size() + 1);
char const* fromNext;
wchar_t* toNext;
mbstate_t state = {0};
std::codecvt_base::result convResult
= std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t> >(loc)
.in(state, &s[0], &s[s.size()], fromNext,
&result[0], &result[result.size()], toNext);
assert(fromNext == &s[s.size()]);
assert(toNext != &result[result.size()]);
assert(convResult == std::codecvt_base::ok);
*toNext = L'\0';
return &result[0];
}
you should replace the assertions by better handling.
BTW, this is standard C++ and doesn't assume Unicode excepted for the computation of the size of result, you can do better by checking convResult which can indicate a partial conversion).
The easiest way is to grab a small library, such as UTF8 CPP and do something like:
utf8::utf8to16(line.begin(), line.end(), back_inserter(utf16line));
I usually use the UnicodeConverter class from the Poco C++ libraries. If you don't want the dependency then you can have a look at the code.
here's what I am trying to do:
typedef uint16_t uchar16_t;
uchar16_t buf[32];
// buf will contain timezone information like GMT-6, Eastern Daylight Time, etc
char * str = "Test";
for (int i = 0; i <= strlen(str); i++)
buf[i] = str[i];
I guess that's not correct since uchar16_t would contain 2 bytes and str contains 1 byte.
What is it that I am supposed to do ?
Strlen? buf[32]? Trying to destroy the universe?
You want to use a wstringstream.
std::wstringstream lols;
lols << "Test";
std::wstring cakes;
lols >> cakes;
Edit#Comment:
You shouldn't use strlen because any decent string system allows embedded zeros, and strlen is seriously slow. In addition, you didn't resize your buffer as needed, so if you had a string of size > 31 you would get a buffer overflow. In addition, you would have to (if you did dynamically size your buffer) manually free it afterwards. Both of these things are serious failings of the C string system. My example code makes your standard library writer do all the work and avoid all these problems for you.
That's actually OK if your string will always be ASCII. To do it correctly, the portable function is mbstowcs which assumes you're converting from the default locale or if you're on Windows then there's API functions that let you specify the source code page explicitly.
Your code will work, as long as str is ASCII; calling strlen() in the loop condition is probably a bad idea, though. It might be easier to just use swprintf() if it's available on your system:
uchar16_t buf[32];
char *str = "Test";
swprintf(buf, sizeof buf, "%s", str);
Have a look here.
Also, is there a good reason you are defining your own type?
If you have a (narrow) char string, you cannot convert it to
a wchar_t string by setting your locale to "C" and then passing
the string through mbstowcs(). That's because the "C" locale specifies
a -particular- character encoding, and that encoding might not match
the encoding of the execution character set, so mbstowcs() might
map the characters to something unexpected, or could even fail
(if the execution character set happened to use encodings that
were incompatible with the encoding structure for the C locale
character set.)
Thus, in order to convert a char
string into a wider string, you have
to copy the chars one by one into an
array of wchar_t . If you need to work
with Unicode or utf-16 or whatever
after that, then wcstombs() is what
you should look at.