How to convert unicode code points to utf-8 in c++?

How to convert unicode code points to utf-8 in c++? - c++

I have an array consisting of unicode code points
unsigned short array[3]={0x20ac,0x20ab,0x20ac};
I just want this to be converted as utf-8 to write into file byte by byte using C++.
Example:
0x20ac should be converted to e2 82 ac.
or is there any other method that can directly write unicode characters in file.

Finally! With C++11!
#include <string>
#include <locale>
#include <codecvt>
#include <cassert>
int main()
{
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
std::string u8str = converter.to_bytes(0x20ac);
assert(u8str == "\xe2\x82\xac");
}

The term Unicode refers to a standard for encoding and handling of text. This incorporates encodings like UTF-8, UTF-16, UTF-32, UCS-2, ...
I guess you are programming in a Windows environment, where Unicode typically refers to UTF-16.
When working with Unicode in C++, I would recommend the ICU library.
If you are programming on Windows, don't want to use an external library, and have no constraints regarding platform dependencies, you can use WideCharToMultiByte.
Example for ICU:
#include <iostream>
#include <unicode\ustream.h>
using icu::UnicodeString;
int main(int, char**) {
//
// Convert from UTF-16 to UTF-8
//
std::wstring utf16 = L"foobar";
UnicodeString str(utf16.c_str());
std::string utf8;
str.toUTF8String(utf8);
std::cout << utf8 << std::endl;
}
To do exactly what you want:
// Assuming you have ICU\include in your include path
// and ICU\lib(64) in your library path.
#include <iostream>
#include <fstream>
#include <unicode\ustream.h>
#pragma comment(lib, "icuio.lib")
#pragma comment(lib, "icuuc.lib")
void writeUtf16ToUtf8File(char const* fileName, wchar_t const* arr, size_t arrSize) {
UnicodeString str(arr, arrSize);
std::string utf8;
str.toUTF8String(utf8);
std::ofstream out(fileName, std::ofstream::binary);
out << utf8;
out.close();
}

Following code may help you,
#include <atlconv.h>
#include <atlstr.h>
#define ASSERT ATLASSERT
int main()
{
const CStringW unicode1 = L"\x0391 and \x03A9"; // 'Alpha' and 'Omega'
const CStringA utf8 = CW2A(unicode1, CP_UTF8);
ASSERT(utf8.GetLength() > unicode1.GetLength());
const CStringW unicode2 = CA2W(utf8, CP_UTF8);
ASSERT(unicode1 == unicode2);
}

This code uses WideCharToMultiByte (I assume that you are using Windows):
unsigned short wide_str[3] = {0x20ac, 0x20ab, 0x20ac};
int utf8_size = WideCharToMultiByte(CP_UTF8, 0, wide_str, 3, NULL, 0, NULL, NULL) + 1;
char* utf8_str = calloc(utf8_size);
WideCharToMultiByte(CP_UTF8, 0, wide_str, 3, utf8_str, utf8_size, NULL, NULL);
You need to call it twice: first time to get number of output bytes, and second time to actually convert it. If you know output buffer size, you may skip first call. Or, you can simply allocate buffer 2x larger than original + 1 byte (for your case it means 12+1 bytes) - it should be always enough.

With std c++
#include <iostream>
#include <locale>
#include <vector>
int main()
{
typedef std::codecvt<wchar_t, char, mbstate_t> Convert;
std::wstring w = L"\u20ac\u20ab\u20ac";
std::locale locale("en_GB.utf8");
const Convert& convert = std::use_facet<Convert>(locale);
std::mbstate_t state;
const wchar_t* from_ptr;
char* to_ptr;
std::vector<char> result(3 * w.size() + 1, 0);
Convert::result convert_result = convert.out(state,
w.c_str(), w.c_str() + w.size(), from_ptr,
result.data(), result.data() + result.size(), to_ptr);
if (convert_result == Convert::ok)
std::cout << result.data() << std::endl;
else std::cout << "Failure: " << convert_result << std::endl;
}

Iconv is a popular library used on many platforms.

I had a similar but slightly different problem. I had strings with the Unicode code point in it as a string representation. Ex: "F\u00f3\u00f3 B\u00e1r". I needed to convert the string code points to their Unicode character.
Here is my C# solution
using System.Globalization;
using System.Text.RegularExpressions;
static void Main(string[] args)
{
Regex CodePoint = new Regex(#"\\u(?<UTF32>....)");
Match Letter;
string s = "F\u00f3\u00f3 B\u00e1r";
string utf32;
Letter = CodePoint.Match(s);
while (Letter.Success)
{
utf32 = Letter.Groups[1].Value;
if (Int32.TryParse(utf32, NumberStyles.HexNumber, CultureInfo.GetCultureInfoByIetfLanguageTag("en-US"), out int HexNum))
s = s.Replace("\\u" + utf32, Char.ConvertFromUtf32(HexNum));
Letter = Letter.NextMatch();
}
Console.WriteLine(s);
}
Output: Fóó Bár

Related

how can I convert wstring to u16string?

I want to convert wstring to u16string in C++.
I can convert wstring to string, or reverse. But I don't know how convert to u16string.
u16string CTextConverter::convertWstring2U16(wstring str)
{
int iSize;
u16string szDest[256] = {};
memset(szDest, 0, 256);
iSize = WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, NULL, 0,0,0);
WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, szDest, iSize,0,0);
u16string s16 = szDest;
return s16;
}
Error in WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, szDest, iSize,0,0);'s szDest. Cause of u16string can't use with LPSTR.
How can I fix this code?

For a platform-independent solution see this answer.
If you need a solution only for the Windows platform, the following code will be sufficient:
std::wstring wstr( L"foo" );
std::u16string u16str( wstr.begin(), wstr.end() );
On the Windows platform, a std::wstring is interchangeable with std::u16string because sizeof(wstring::value_type) == sizeof(u16string::value_type) and both are UTF-16 (little endian) encoded.
wstring::value_type = wchar_t
u16string::value_type = char16_t
The only difference is that wchar_t is signed, whereas char16_t is unsigned. So you only have to do sign conversion, which can be performed by using the u16string constructor that takes an iterator pair as arguments. This constructor will implicitly convert wchar_t to char16_t.
Full example console application:
#include <windows.h>
#include <string>
int main()
{
static_assert( sizeof(std::wstring::value_type) == sizeof(std::u16string::value_type),
"std::wstring and std::u16string are expected to have the same character size" );
std::wstring wstr( L"foo" );
std::u16string u16str( wstr.begin(), wstr.end() );
// The u16string constructor performs an implicit conversion like:
wchar_t wch = L'A';
char16_t ch16 = wch;
// Need to reinterpret_cast because char16_t const* is not implicitly convertible
// to LPCWSTR (aka wchar_t const*).
::MessageBoxW( 0, reinterpret_cast<LPCWSTR>( u16str.c_str() ), L"test", 0 );
return 0;
}

Update
I had thought the standard version did not work, but in fact this was simply due to bugs in the Visual C++ and libstdc++ 3.4.21 runtime libraries. It does work with clang++ -std=c++14 -stdlib=libc++. Here is a version that tests whether the standard method works on your compiler:
#include <codecvt>
#include <cstdlib>
#include <cstring>
#include <cwctype>
#include <iostream>
#include <locale>
#include <clocale>
#include <vector>
using std::cout;
using std::endl;
using std::exit;
using std::memcmp;
using std::size_t;
using std::wcout;
#if _WIN32 || _WIN64
// Windows needs a little non-standard magic for this to work.
#include <io.h>
#include <fcntl.h>
#include <locale.h>
#endif
using std::size_t;
void init_locale(void)
// Does magic so that wcout can work.
{
#if _WIN32 || _WIN64
// Windows needs a little non-standard magic.
constexpr char cp_utf16le[] = ".1200";
setlocale( LC_ALL, cp_utf16le );
_setmode( _fileno(stdout), _O_U16TEXT );
#else
// The correct locale name may vary by OS, e.g., "en_US.utf8".
constexpr char locale_name[] = "";
std::locale::global(std::locale(locale_name));
std::wcout.imbue(std::locale());
#endif
}
int main(void)
{
constexpr char16_t msg_utf16[] = u"¡Hola, mundo! \U0001F600"; // Shouldn't assume endianness.
constexpr wchar_t msg_w[] = L"¡Hola, mundo! \U0001F600";
constexpr char32_t msg_utf32[] = U"¡Hola, mundo! \U0001F600";
constexpr char msg_utf8[] = u8"¡Hola, mundo! \U0001F600";
init_locale();
const std::codecvt_utf16<wchar_t, 0x1FFFF, std::little_endian> converter_w;
const size_t max_len = sizeof(msg_utf16);
std::vector<char> out(max_len);
std::mbstate_t state;
const wchar_t* from_w = nullptr;
char* to_next = nullptr;
converter_w.out( state, msg_w, msg_w+sizeof(msg_w)/sizeof(wchar_t), from_w, out.data(), out.data() + out.size(), to_next );
if (memcmp( msg_utf8, out.data(), sizeof(msg_utf8) ) == 0 ) {
wcout << L"std::codecvt_utf16<wchar_t> converts to UTF-8, not UTF-16!" << endl;
} else if ( memcmp( msg_utf16, out.data(), max_len ) != 0 ) {
wcout << L"std::codecvt_utf16<wchar_t> conversion not equal!" << endl;
} else {
wcout << L"std::codecvt_utf16<wchar_t> conversion is correct." << endl;
}
out.clear();
out.resize(max_len);
const std::codecvt_utf16<char32_t, 0x1FFFF, std::little_endian> converter_u32;
const char32_t* from_u32 = nullptr;
converter_u32.out( state, msg_utf32, msg_utf32+sizeof(msg_utf32)/sizeof(char32_t), from_u32, out.data(), out.data() + out.size(), to_next );
if ( memcmp( msg_utf16, out.data(), max_len ) != 0 ) {
wcout << L"std::codecvt_utf16<char32_t> conversion not equal!" << endl;
} else {
wcout << L"std::codecvt_utf16<char32_t> conversion is correct." << endl;
}
wcout << msg_w << endl;
return EXIT_SUCCESS;
}
Previous
A bit late to the game, but here’s a version that additionally checks whether wchar_t is 32-bits (as it is on Linux), and if so, performs surrogate-pair conversion. I recommend saving this source as UTF-8 with a BOM. Here is a link to it on ideone.
#include <cassert>
#include <cwctype>
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <locale>
#include <string>
#if _WIN32 || _WIN64
// Windows needs a little non-standard magic for this to work.
#include <io.h>
#include <fcntl.h>
#include <locale.h>
#endif
using std::size_t;
void init_locale(void)
// Does magic so that wcout can work.
{
#if _WIN32 || _WIN64
// Windows needs a little non-standard magic.
constexpr char cp_utf16le[] = ".1200";
setlocale( LC_ALL, cp_utf16le );
_setmode( _fileno(stdout), _O_U16TEXT );
#else
// The correct locale name may vary by OS, e.g., "en_US.utf8".
constexpr char locale_name[] = "";
std::locale::global(std::locale(locale_name));
std::wcout.imbue(std::locale());
#endif
}
std::u16string make_u16string( const std::wstring& ws )
/* Creates a UTF-16 string from a wide-character string. Any wide characters
* outside the allowed range of UTF-16 are mapped to the sentinel value U+FFFD,
* per the Unicode documentation. (http://www.unicode.org/faq/private_use.html
* retrieved 12 March 2017.) Unpaired surrogates in ws are also converted to
* sentinel values. Noncharacters, however, are left intact. As a fallback,
* if wide characters are the same size as char16_t, this does a more trivial
* construction using that implicit conversion.
*/
{
/* We assume that, if this test passes, a wide-character string is already
* UTF-16, or at least converts to it implicitly without needing surrogate
* pairs.
*/
if ( sizeof(wchar_t) == sizeof(char16_t) ) {
return std::u16string( ws.begin(), ws.end() );
} else {
/* The conversion from UTF-32 to UTF-16 might possibly require surrogates.
* A surrogate pair suffices to represent all wide characters, because all
* characters outside the range will be mapped to the sentinel value
* U+FFFD. Add one character for the terminating NUL.
*/
const size_t max_len = 2 * ws.length() + 1;
// Our temporary UTF-16 string.
std::u16string result;
result.reserve(max_len);
for ( const wchar_t& wc : ws ) {
const std::wint_t chr = wc;
if ( chr < 0 || chr > 0x10FFFF || (chr >= 0xD800 && chr <= 0xDFFF) ) {
// Invalid code point. Replace with sentinel, per Unicode standard:
constexpr char16_t sentinel = u'\uFFFD';
result.push_back(sentinel);
} else if ( chr < 0x10000UL ) { // In the BMP.
result.push_back(static_cast<char16_t>(wc));
} else {
const char16_t leading = static_cast<char16_t>(
((chr-0x10000UL) / 0x400U) + 0xD800U );
const char16_t trailing = static_cast<char16_t>(
((chr-0x10000UL) % 0x400U) + 0xDC00U );
result.append({leading, trailing});
} // end if
} // end for
/* The returned string is shrunken to fit, which might not be the Right
* Thing if there is more to be added to the string.
*/
result.shrink_to_fit();
// We depend here on the compiler to optimize the move constructor.
return result;
} // end if
// Not reached.
}
int main(void)
{
static const std::wstring wtest(L"☪☮∈✡℩☯✝ \U0001F644");
static const std::u16string u16test(u"☪☮∈✡℩☯✝ \U0001F644");
const std::u16string converted = make_u16string(wtest);
init_locale();
std::wcout << L"sizeof(wchar_t) == " << sizeof(wchar_t) << L".\n";
for( size_t i = 0; i <= u16test.length(); ++i ) {
if ( u16test[i] != converted[i] ) {
std::wcout << std::hex << std::showbase
<< std::right << std::setfill(L'0')
<< std::setw(4) << (unsigned)converted[i] << L" ≠ "
<< std::setw(4) << (unsigned)u16test[i] << L" at "
<< i << L'.' << std::endl;
return EXIT_FAILURE;
} // end if
} // end for
std::wcout << wtest << std::endl;
return EXIT_SUCCESS;
}
Footnote
Since someone asked: The reason I suggest UTF-8 with BOM is that some compilers, including MSVC 2015, will assume a source file is encoded according to the current code page unless there is a BOM or you specify an encoding on the command line. No encoding works on all toolchains, unfortunately, but every tool I’ve used that’s modern enough to support C++14 also understands the BOM.

- To convert CString to std:wstring and string
string CString2string(CString str)
{
int bufLen = WideCharToMultiByte(CP_UTF8, 0, (LPCTSTR)str, -1, NULL, 0, NULL,NULL);
char *buf = new char[bufLen];
WideCharToMultiByte(CP_UTF8, 0, (LPCTSTR)str, -1, buf, bufLen, NULL, NULL);
string sRet(buf);
delete[] buf;
return sRet;
}
CString strFileName = "test.txt";
wstring wFileName(strFileName.GetBuffer());
strFileName.ReleaseBuffer();
string sFileName = CString2string(strFileName);
- To convert string to CString
CString string2CString(string s)
{
int bufLen = MultiByteToWideChar(CP_ACP, 0, s.c_str(), -1, NULL, 0);
WCHAR *buf = new WCHAR[bufLen];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), -1, buf, bufLen);
CString strRet(buf);
delete[] buf;
return strRet;
}
string sFileName = "test.txt";
CString strFileName = string2CString(sFileName);

How do I convert a string to a wstring using the value of the string?

I'm new to C++ and I have this issue. I have a string called DATA_DIR that I need for format into a wstring.
string str = DATA_DIR;
std::wstring temp(L"%s",str);
Visual Studio tells me that there is no instance of constructor that matches with the argument list. Clearly, I'm doing something wrong.
I found this example online
std::wstring someText( L"hello world!" );
which apparently works (no compile errors). My question is, how do I get the string value stored in DATA_DIR into the wstring constructor as opposed to something arbitrary like "hello world"?

Here is an implementation using wcstombs (Updated):
#include <iostream>
#include <cstdlib>
#include <string>
std::string wstring_from_bytes(std::wstring const& wstr)
{
std::size_t size = sizeof(wstr.c_str());
char *str = new char[size];
std::string temp;
std::wcstombs(str, wstr.c_str(), size);
temp = str;
delete[] str;
return temp;
}
int main()
{
std::wstring wstr = L"abcd";
std::string str = wstring_from_bytes(wstr);
}
Here is a demo.

This is in reference to the most up-voted answer but I don't have enough "reputation" to just comment directly on the answer.
The name of the function in the solution "wstring_from_bytes" implies it is doing what the original poster wants, which is to get a wstring given a string, but the function is actually doing the opposite of what the original poster asked for and would more accurately be named "bytes_from_wstring".
To convert from string to wstring, the wstring_from_bytes function should use mbstowcs not wcstombs
#define _CRT_SECURE_NO_WARNINGS
#include <iostream>
#include <cstdlib>
#include <string>
std::wstring wstring_from_bytes(std::string const& str)
{
size_t requiredSize = 0;
std::wstring answer;
wchar_t *pWTempString = NULL;
/*
* Call the conversion function without the output buffer to get the required size
* - Add one to leave room for the NULL terminator
*/
requiredSize = mbstowcs(NULL, str.c_str(), 0) + 1;
/* Allocate the output string (Add one to leave room for the NULL terminator) */
pWTempString = (wchar_t *)malloc( requiredSize * sizeof( wchar_t ));
if (pWTempString == NULL)
{
printf("Memory allocation failure.\n");
}
else
{
// Call the conversion function with the output buffer
size_t size = mbstowcs( pWTempString, str.c_str(), requiredSize);
if (size == (size_t) (-1))
{
printf("Couldn't convert string\n");
}
else
{
answer = pWTempString;
}
}
if (pWTempString != NULL)
{
delete[] pWTempString;
}
return answer;
}
int main()
{
std::string str = "abcd";
std::wstring wstr = wstring_from_bytes(str);
}
Regardless, this is much more easily done in newer versions of the standard library (C++ 11 and newer)
#include <locale>
#include <codecvt>
#include <string>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);

printf-style format specifiers are not part of the C++ library and cannot be used to construct a string.
If the string may only contain single-byte characters, then the range constructor is sufficient.
std::string narrower( "hello" );
std::wstring wider( narrower.begin(), narrower.end() );
The problem is that we usually use wstring when wide characters are applicable (hence the w), which are represented in std::string by multibyte sequences. Doing this will cause each byte of a multibyte sequence to translate to an sequence of incorrect wide characters.
Moreover, to convert a multibyte sequence requires knowing its encoding. This information is not encapsulated by std::string nor std::wstring. C++11 allows you to specify an encoding and translate using std::wstring_convert, but I'm not sure how widely supported it is of yet. See 0x....'s excellent answer.

The converter mentioned for C++11 and above has deprecated this specific conversion in C++17, and suggests using the MultiByteToWideChar function.
The compiler error (c4996) mentions defining _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING.

wstring temp = L"";
for (auto c : DATA_DIR)
temp.push_back(c);

I found this function. Could not find any predefined method to do this.
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
std::wstring stemp = s2ws(myString);

How to convert from UTF-8 to ANSI using standard c++

I have some strings read from the database, stored in a char* and in UTF-8 format (you know, "á" is encoded as 0xC3 0xA1). But, in order to write them to a file, I first need to convert them to ANSI (can't make the file in UTF-8 format... it's only read as ANSI), so that my "á" doesn't become "Ã¡". Yes, I know some data will be lost (chinese characters, and in general anything not in the ANSI code page) but that's exactly what I need.
But the thing is, I need the code to compile in various platforms, so it has to be standard C++ (i.e. no Winapi, only stdlib, stl, crt or any custom library with available source).
Anyone has any suggestions?

A few days ago, somebody answered that if I had a C++11 compiler, I could try this:
#include <string>
#include <codecvt>
#include <locale>
string utf8_to_string(const char *utf8str, const locale& loc)
{
// UTF-8 to wstring
wstring_convert<codecvt_utf8<wchar_t>> wconv;
wstring wstr = wconv.from_bytes(utf8str);
// wstring to string
vector<char> buf(wstr.size());
use_facet<ctype<wchar_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', buf.data());
return string(buf.data(), buf.size());
}
int main(int argc, char* argv[])
{
string ansi;
char utf8txt[] = {0xc3, 0xa1, 0};
// I guess you want to use Windows-1252 encoding...
ansi = utf8_to_string(utf8txt, locale(".1252"));
// Now do something with the string
return 0;
}
Don't know what happened to the response, apparently someone deleted it. But, turns out that it is the perfect solution. To whoever posted, thanks a lot, and you deserve the AC and upvote!!

If you mean ASCII, just discard any byte that has bit 7 set, this will remove all multibyte sequences. Note that you could create more advanced algorithms, like removing the accent from the "á", but that would require much more work.

This should work:
#include <string>
#include <codecvt>
using namespace std::string_literals;
std::string to_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
using wcvt = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>;
std::u32string wstr(str.size(), U'\0');
std::use_facet<std::ctype<char32_t>>(loc).widen(str.data(), str.data() + str.size(), &wstr[0]);
return wcvt{}.to_bytes(wstr.data(),wstr.data() + wstr.size());
}
std::string from_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
using wcvt = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>;
auto wstr = wcvt{}.from_bytes(str);
std::string result(wstr.size(), '0');
std::use_facet<std::ctype<char32_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', &result[0]);
return result;
}
int main() {
auto s0 = u8"Blöde C++ Scheiße äöü!!1Elf"s;
auto s1 = from_utf8(s0);
auto s2 = to_utf8(s1);
return 0;
}
For VC++:
#include <string>
#include <codecvt>
using namespace std::string_literals;
std::string to_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
using wcvt = std::wstring_convert<std::codecvt_utf8<int32_t>, int32_t>;
std::u32string wstr(str.size(), U'\0');
std::use_facet<std::ctype<char32_t>>(loc).widen(str.data(), str.data() + str.size(), &wstr[0]);
return wcvt{}.to_bytes(
reinterpret_cast<const int32_t*>(wstr.data()),
reinterpret_cast<const int32_t*>(wstr.data() + wstr.size())
);
}
std::string from_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
using wcvt = std::wstring_convert<std::codecvt_utf8<int32_t>, int32_t>;
auto wstr = wcvt{}.from_bytes(str);
std::string result(wstr.size(), '0');
std::use_facet<std::ctype<char32_t>>(loc).narrow(
reinterpret_cast<const char32_t*>(wstr.data()),
reinterpret_cast<const char32_t*>(wstr.data() + wstr.size()),
'?', &result[0]);
return result;
}
int main() {
auto s0 = u8"Blöde C++ Scheiße äöü!!1Elf"s;
auto s1 = from_utf8(s0);
auto s2 = to_utf8(s1);
return 0;
}

#include <stdio.h>
#include <string>
#include <codecvt>
#include <locale>
#include <vector>
using namespace std;
std::string utf8_to_string(const char *utf8str, const locale& loc){
// UTF-8 to wstring
wstring_convert<codecvt_utf8<wchar_t>> wconv;
wstring wstr = wconv.from_bytes(utf8str);
// wstring to string
vector<char> buf(wstr.size());
use_facet<ctype<wchar_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', buf.data());
return string(buf.data(), buf.size());
}
int main(int argc, char* argv[]){
std::string ansi;
char utf8txt[] = {0xc3, 0xa1, 0};
// I guess you want to use Windows-1252 encoding...
ansi = utf8_to_string(utf8txt, locale(".1252"));
// Now do something with the string
return 0;
}

Convert WCHAR[260] to std::string

I have gotten a WCHAR[MAX_PATH] from (PROCESSENTRY32) pe32.szExeFile on Windows. The following do not work:
std::string s;
s = pe32.szExeFile; // compile error. cast (const char*) doesnt work either
and
std::string s;
char DefChar = ' ';
WideCharToMultiByte(CP_ACP,0,pe32.szExeFile,-1, ch,260,&DefChar, NULL);
s = pe32.szExeFile;

For your first example you can just do:
std::wstring s(pe32.szExeFile);
and for second:
char DefChar = ' ';
WideCharToMultiByte(CP_ACP,0,pe32.szExeFile,-1, ch,260,&DefChar, NULL);
std::wstring s(pe32.szExeFile);
as std::wstring has a char* ctor

Your call to WideCharToMultiByte looks correct, provided ch is a
sufficiently large buffer. After than, however, you want to assign the
buffer (ch) to the string (or use it to construct a string), not
pe32.szExeFile.

There are convenient conversion classes from ATL; you may want to use some of them, e.g.:
std::string s( CW2A(pe32.szExeFile) );
Note however that a conversion from Unicode UTF-16 to ANSI can be lossy. If you wan't a non-lossy conversion, you could convert from UTF-16 to UTF-8, and store UTF-8 inside std::string.
If you don't want to use ATL, there are some convenient freely available C++ wrappers around raw Win32 WideCharToMultiByte to convert from UTF-16 to UTF-8 using STL strings.

#ifndef __STRINGCAST_H__
#define __STRINGCAST_H__
#include <vector>
#include <string>
#include <cstring>
#include <cwchar>
#include <cassert>
template<typename Td>
Td string_cast(const wchar_t* pSource, unsigned int codePage = CP_ACP);
#endif // __STRINGCAST_H__
template<>
std::string string_cast( const wchar_t* pSource, unsigned int codePage )
{
assert(pSource != 0);
size_t sourceLength = std::wcslen(pSource);
if(sourceLength > 0)
{
int length = ::WideCharToMultiByte(codePage, 0, pSource, sourceLength, NULL, 0, NULL, NULL);
if(length == 0)
return std::string();
std::vector<char> buffer( length );
::WideCharToMultiByte(codePage, 0, pSource, sourceLength, &buffer[0], length, NULL, NULL);
return std::string(buffer.begin(), buffer.end());
}
else
return std::string();
}
and use this template as followed
PWSTR CurWorkDir;
std::string CurWorkLogFile;
CurWorkDir = new WCHAR[length];
CurWorkLogFile = string_cast<std::string>(CurWorkDir);
....
delete [] CurWorkDir;

Need help replacing swprintf

I am using some cross platform stuff called nutcracker to go between Windows and Linux, to make a long story short its limited in its support for wide string chars. I have to take the code below and replace what the swprintf is doing and I have no idea how. My experience with low level byte manipulation sucks. Can someone please help me with this?
Please keep in mind I can't go crazy and re-write swprintf but get the basic functionality to format the pwszString correctly from the data in pBuffer. This is c++ using the Microsoft vc6.0 compiler but through CXX so it's limited as well.
The wszSep is just a delimeter, either "" or "-" for readabilty when printing.
HRESULT BufferHelper::Buff2StrASCII(
/*[in]*/ const unsigned char * pBuffer,
/*[in]*/ int iSize,
/*[in]*/ LPWSTR wszSep,
/*[out]*/ LPWSTR* pwszString )
{
// Check args
if (! pwszString) return E_POINTER;
// Allocate memory
int iSep = (int)wcslen(wszSep);
*pwszString = new WCHAR [ (((iSize * ( 2 + iSep )) + 1 ) - iSep ) ];
if (! pwszString) return E_OUTOFMEMORY;
// Loop
int i = 0;
for (i=0; i< iSize; i++)
{
swprintf( (*pwszString)+(i*(2+iSep)), L"%02X%s", pBuffer[i], (i!=(iSize-1)) ? wszSep : L"" );
}
return S_OK;
}
This takes whats in the pBuffer and encodes the wide buffer with ascii. I use typedef const unsigned short* LPCWSTR; because that type does not exist in the nutcracker.
I can post more if you need to see more code.
Thanks.

It is a bit hard to understand exactly what you are looking for, so I've guessed.
As the tag was "C++", not "C" I have converted it to work in a more "C++" way. I don't have a linux box to try this on, but I think it will probably compile OK.
Your description of the input data sounded like UTF-16 wide characters, so I've used a std::wstring for the input buffer. If that is wrong, change it to a std::vector of unsigned chars and adjust the formatting statement accordingly.
#include <string>
#include <vector>
#include <cerrno>
#include <iostream>
#include <iomanip>
#include <sstream>
#if !defined(S_OK)
#define S_OK 0
#define E_OUTOFMEMORY ENOMEM
#endif
unsigned Buff2StrASCII(
const std::wstring &sIn,
const std::wstring &sSep,
std::wstring &sOut)
{
try
{
std::wostringstream os;
for (size_t i=0; i< sIn.size(); i++)
{
if (i)
os << sSep;
os << std::setw(4) << std::setfill(L'0') << std::hex
<< static_cast<unsigned>(sIn[i]);
}
sOut = os.str();
return S_OK;
}
catch (std::bad_alloc &)
{
return E_OUTOFMEMORY;
}
}
int main(int argc, char *argv[])
{
wchar_t szIn[] = L"The quick brown fox";
std::wstring sOut;
Buff2StrASCII(szIn, L" - ", sOut);
std::wcout << sOut << std::endl;
return 0;
}

Why would you use the WSTR types at all in nutcracker / the linux build? Most unixes and linux use utf-8 for their filesystem representation, so, in the non windows builds, you use sprintf, and char*'s.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to convert unicode code points to utf-8 in c++? - c++

Finally! With C++11! #include <string> #include <locale> #include <codecvt> #include <cassert> int main() { std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter; std::string u8str = converter.to_bytes(0x20ac); assert(u8str == "\xe2\x82\xac"); }

Iconv is a popular library used on many platforms.

Related

how can I convert wstring to u16string?

How do I convert a string to a wstring using the value of the string?

How to convert from UTF-8 to ANSI using standard c++

Convert WCHAR[260] to std::string

Need help replacing swprintf

Categories

Resources