Converting "normal" std::string to utf-8 - c++

Let's see if I can explain this without too many factual errors...
I'm writing a string class and I want it to use utf-8 (stored in a std::string) as it's internal storage.
I want it to be able to take both "normal" std::string and std::wstring as input and output.
Working with std::wstring is not a problem, I can use std::codecvt_utf8<wchar_t> to convert both from and to std::wstring.
However after extensive googling and searching on SO I have yet to find a way to convert between a "normal/default" C++ std::string (which I assume in Windows is using the local system localization?) and an utf-8 std::string.
I guess one option would be to first convert the std::string to an std::wstring using std::codecvt<wchar_t, char> and then convert it to utf-8 as above, but this seems quite inefficient given that at least the first 128 values of a char should translate straight over to utf-8 without conversion regardless of localization if I understand correctly.
I found this similar question: C++: how to convert ASCII or ANSI to UTF8 and stores in std::string
Although I'm a bit skeptic towards that answer as it's hard coded to latin 1 and I want this to work with all types of localization to be on the safe side.
No answers involving boost thanks, I don't want the headache of getting my codebase to work with it.

If your "normal string" is encoded using the system's code page and you want to convert it to UTF-8 then this should work:
std::string codepage_str;
int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), nullptr, 0);
std::wstring utf16_str(size, '\0');
MultiByteToWideChar(CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), &utf16_str[0], size);
int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), nullptr, 0,
nullptr, nullptr);
std::string utf8_str(utf8_size, '\0');
WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), &utf8_str[0], utf8_size,
nullptr, nullptr);

Related

Convert ICU Unicode string to std::wstring (or wchar_t*)

Is there an icu function to create a std::wstring from an icu UnicodeString ? I have been searching the ICU manual but haven't been able to find one.
(I know i can convert UnicodeString to UTF8 and then convert to platform dependent wchar_t* but i am looking for one function in UnicodeString which can do this conversion.
The C++ standard doesn't dictate any specific encoding for std::wstring. On Windows systems, wchar_t is 16-bit, and on Linux, macOS, and several other platforms, wchar_t is 32-bit. As far as C++'s std::wstring is concerned, it is just an arbitrary sequence of wchar_t in much the same way that std::string is just an arbitrary sequence of char.
It seems that icu::UnicodeString has no in-built way of creating a std::wstring, but if you really want to create a std::wstring anyway, you can use the C-based API u_strToWCS() like this:
icu::UnicodeString ustr = /* get from somewhere */;
std::wstring wstr;
int32_t requiredSize;
UErrorCode error = U_ZERO_ERROR;
// obtain the size of string we need
u_strToWCS(nullptr, 0, &requiredSize, ustr.getBuffer(), ustr.length(), &error);
// resize accordingly (this will not include any terminating null character, but it also doesn't need to either)
wstr.resize(requiredSize);
// copy the UnicodeString buffer to the std::wstring.
u_strToWCS(wstr.data(), wstr.size(), nullptr, ustr.getBuffer(), ustr.length(), &error);
Supposedly, u_strToWCS() will use the most efficient method for converting from UChar to wchar_t (if they are the same size, then it is just a straightfoward copy I suppose).

PathFileExistsA fails for UTF-8?

I'm working with VS2017 and need to support UTF-8 paths due to an SDK that only supports UTF-8. Within my code, I'd like to test whether a UTF-8 path is valid, so am using
PathFileExistsA( path );
but it fails for a path I know is valid. (It passes if "path" has only ascii characters -- no chars requiring UTF-8).
I realized the "A" in PathFileExistsA stands for Ascii, but that's at the exclusion of UTF-8? Its counterpart is PathFileExistsW, but I can't use wide chars.
All I'm after is a test to determine whether a UTF-8 path is valid, so can use another function if more suitable.
Windows natively uses UTF-16 (wide chars) for its APIs. If you can't use wide chars for your input, then you can accept it as UTF-8 and convert it to wide chars using MultiByteToWideChar() and then call the wide char version of the API function:
char* lpUtf8 = ...;
// Look up the size of the wide string
size_t wideSize = ::MultiByteToWideChar( CP_UTF8, 0, lpUtf8, -1, 0, 0 );
// Allocate the string
wchar_t* lpWideString = new wchar_t[wideSize];
// Do the conversion
::MultiByteToWideChar( CP_UTF8, 0, lpUtf8, -1, lpWideString, wideSize );
// Call the wide function
::PathFileExistsW( lpWideString );
// Deallocate the string
delete[] lpWideString;
It's much cleaner if you use the STL string functions. This article is very good:
C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs

C++ append int to wstring

Before(using ASCII) i was using std::string as buffer like this:
std::string test = "";
int value = 6;
test.append("some string");
test.append((char*)value, 4);
test.append("some string");
with expected value in test:
"some srtring\x6\x0\x0\x0somestring"
Now i am tring to use Unicode and i wanna keep the same "code" but trubles happens:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 4); (buffer overflow cause reading 8 bytes)
test.append("some string");
How can i append bytes like in std::string?
Doing:
std::wstring test = "";
int value = 6;
test.append("some string");
test.append((wchar_t*)value, 2);
test.append("some string");
Solve partially the problem cause after i can't append bools.
EDIT:
i can even use wstringstream if a binary copy is applied.(normally not)
You're confusing unicode and character encodings. An std::string can represent unicode code points just fine, using the UTF-8 encoding.
Windows uses the UTF-16LE (or UTF-16 with a BOM, I believe) encoding to represent unicode glyphs. Most others use UTF-8.
An std::string which is encoded in UTF-8 and which uses only ASCII characters can actually be interpreted as an ASCII string. This is the beauty of UTF-8. It's a natural extension.
Anyway,
i need a "binary" dynamic buffer, where i can add the real size of types(bool 1, int 4 etc)
An std::vector<uint8_t> is probably more suitable for this task. It communicates that it is not something human-readable, per se. If you need to embed strings into this buffer, make sure that sizeof(char) == sizeof(uint8_t) on the platform, and then just write the data as-is to this buffer.
If you're saving this buffer on one machine and try to read it on another machine, you have to take care of endianness too.
You make a function that reads the stuff you want to put:
void putBytes(std::wstring& s, char* c, int numBytes)
{
while (numBytes-- > 0)
s += (wchar_t)*c++;
}
Then you can call it:
int value = 65;
putBytes(s, reinterpret_cast<char*>(&value), sizeof(value));
I think a IStream is a proper way to do this...i'll make an interface to handle different types. I was abusing std::string for an easy "dynamic binary array", with std::wstring this is not possible,for many reasons but most silly one is that require at least 2 bytes, so no room for a bool

Given a buffer (void*) and an encoding name, how can I find a string or character's index?

I have a void* to a buffer (and its length, in bytes). I also know the text encoding of the buffer, it might be plain ASCII, UTF-8, Shift-JIS, UTF-16 (both LE or BE), or something else. And I need to find out if "<<<" appears in the buffer.
I'm using C++11 (VS2013) and coming from a .NET background the problem wouldn't be that hard: create an Encoding object instance for the encoding, then pass it the data (as Byte[]) and convert it to a String instance (internally using UTF-16LE), and use the string functions IndexOf or a Regex.
However the C++ equivalent doesn't necessarily work as Microsoft's C++ runtime library, specifically its implementation of locale, does not support UTF-8 multi-byte encoding (a complete mystery to me). I'm also not keen on the prospect of performing so many buffer copies (like how in .NET, you would Marshal.Copy the void* into Byte[], and then again into a String via the Encoding instance.
Here's the psuedo-code high-level logic I'm after:
const char* needle = "<<<";
size_t IndexOf(void* buffer, size_t bufferCB, std::string encodingName, std::string needle) {
char* needleEncoded = Convert( needle, encodingName );
setlocale( encodingName );
size_t index = strstr( buffer, bufferCB, needleEncoded );
return index;
}
My versions of setlocale and strstr aren't in the standard, but exemplify the kind of behaviour I need.

How to convert wstring into byte vector

Hi I have a few typedefs:
typedef unsigned char Byte;
typedef std::vector<Byte> ByteVector;
typedef std::wstring String;
I need to convert String into ByteVector, I have tried this:
String str = L"123";
ByteVector vect(str.begin(), str.end());
As a result vectror contains 3 elements: 1, 2, 3. However it is wstring so every charcter in this string is wide so my expected result would be: 0, 1, 0, 2, 0, 3.
Is there any standart way to do that or I need to write some custom function.
Byte const* p = reinterpret_cast<Byte const*>(&str[0]);
std::size_t size = str.size() * sizeof(str.front());
ByteVector vect(p, p+size);
What is your actual goal? If you just want to get the bytes representing the wchar_t objects, a fairly trivial conversion would do the trick although I wouldn't use just a cast to to unsigned char const* but rather an explicit conversion.
On the other hand, if you actually want to convert the std::wstring into a sequence encoded using e.g. UTF8 or UTF16 as is usually the case when dealing with characters, the conversion used for the encoding becomes significantly more complex. Probably the easiest approach to convert to an encoding is to use C's wcstombs():
std::vector<char> target(source.size() * 4);
size_t n = wcstombs(&target[0], &source[0], target.size());
The above fragment assumes that source isn't empty and that the last wchar_t in source is wchar_t(). The conversion uses C's global locale and assumes to convert whatever character encoding is set up there. There is also a version wcstombs_l() where you can specify the locale.
C++ has similar functionality but it is a bit harder to use in the std::codecvt<...> facet. I can provide an example if necessary.