Convert C++ std::string to UTF-16-LE encoded string - c++

I've been searching for hours today and just can't find anything that works out for me. The one I've just had a look at, with no luck, is "How to convert UTF-8 encoded std::string to UTF-16 std::string".
My question is, with a brief explanation:
I want to make a valid NTLM hash in std C++, and I'm using OpenSSL's library to create the hash using its MD4 routines. I know how to do that, so does anyone know how to convert the std::string into a UTF-16 LE encoded string which I can pass to the MD4 functions to get a correct digest?
So, can I have a std::string which holds the char type, and convert it to a UTF16-LE encoded variable length std::string_type? Whether that be std::u16string, or std::wstring?
And would I use s.c_str() or s.data() and would the length() function report correctly in both cases?

I think something like this should do the trick:
std::string utf16_to_utf8(std::u16string const& s)
{
std::wstring_convert<std::codecvt_utf8_utf16<char16_t, 0x10ffff,
std::codecvt_mode::little_endian>, char16_t> cnv;
std::string utf8 = cnv.to_bytes(s);
if(cnv.converted() < s.size())
throw std::runtime_error("incomplete conversion");
return utf8;
}
std::u16string utf8_to_utf16(std::string const& utf8)
{
std::wstring_convert<std::codecvt_utf8_utf16<char16_t, 0x10ffff,
std::codecvt_mode::little_endian>, char16_t> cnv;
std::u16string s = cnv.from_bytes(utf8);
if(cnv.converted() < utf8.size())
throw std::runtime_error("incomplete conversion");
return s;
}
Note: that std::wstring_convert is deprecated in C++17 but I still favor using it rather than a non-standard library given that it is portable, has no dependencies and will no doubt remain until replaced.
And, if all else fails, you can reimplement these same functions with alternative code without changing any other part of the application.

Apologies, firsthand... this will be an ugly reply with some long code. I ended up using the following function, while effectively compiling in iconv into my windows application file by file :)
Hope this helps.
char* conver(const char* in, size_t in_len, size_t* used_len)
{
const int CC_MUL = 2; // 16 bit
setlocale(LC_ALL, "");
char* t1 = setlocale(LC_CTYPE, "");
char* locn = (char*)calloc(strlen(t1) + 1, sizeof(char));
if(locn == NULL)
{
return 0;
}
strcpy(locn, t1);
const char* enc = strchr(locn, '.') + 1;
#if _WINDOWS
std::string win = "WINDOWS-";
win += enc;
enc = win.c_str();
#endif
iconv_t foo = iconv_open("UTF-16LE", enc);
if(foo == (void*)-1)
{
if (errno == EINVAL)
{
fprintf(stderr, "Conversion from %s is not supported\n", enc);
}
else
{
fprintf(stderr, "Initialization failure:\n");
}
free(locn);
return 0;
}
size_t out_len = CC_MUL * in_len;
size_t saved_in_len = in_len;
iconv(foo, NULL, NULL, NULL, NULL);
char* converted = (char*)calloc(out_len, sizeof(char));
char *converted_start = converted;
char* t = const_cast<char*>(in);
int ret = iconv(foo,
&t,
&in_len,
&converted,
&out_len);
iconv_close(foo);
*used_len = CC_MUL * saved_in_len - out_len;
if(ret == -1)
{
switch(errno)
{
case EILSEQ:
fprintf(stderr, "EILSEQ\n");
break;
case EINVAL:
fprintf(stderr, "EINVAL\n");
break;
}
perror("iconv");
free(locn);
return 0;
}
else
{
free(locn);
return converted_start;
}
}

Related

C++17 UTF8 std::string to std::wstring UTF32 using unicode.org code or C++ standard functions?

Looking for a working solution to the classic UTF8 to UTF32 in a stable and tested system.
Now I have the source to Unicode.org's
C code:
https://android.googlesource.com/platform/external/id3lib/+/master/unicode.org/ConvertUTF.c
https://android.googlesource.com/platform/external/id3lib/+/master/unicode.org/ConvertUTF.h
License:
https://android.googlesource.com/platform/external/id3lib/+/master/unicode.org/readme.txt
Using the following C++ which interfaces the C library code from above:
std::wstring Utf8_To_wstring(const std::string& utf8string)
{
if (utf8string.length()==0)
{
return std::wstring();
}
size_t widesize = utf8string.length();
if (sizeof(wchar_t) == 2)
{
std::wstring resultstring;
resultstring.resize(widesize, L'\0');
const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
const UTF8* sourceend = sourcestart + widesize;
UTF16* targetstart = reinterpret_cast<UTF16*>(&resultstring[0]);
UTF16* targetend = targetstart + widesize;
ConversionResult res = ConvertUTF8toUTF16(&sourcestart, sourceend, &targetstart, targetend, strictConversion);
if (res != conversionOK)
{
return std::wstring(utf8string.begin(), utf8string.end());
}
*targetstart = 0;
return std::wstring(resultstring.c_str());
}
else if (sizeof(wchar_t) == 4)
{
std::wstring resultstring;
resultstring.resize(widesize, L'\0');
const UTF8* sourcestart = reinterpret_cast<const UTF8*>(utf8string.c_str());
const UTF8* sourceend = sourcestart + widesize;
UTF32* targetstart = reinterpret_cast<UTF32*>(&resultstring[0]);
UTF32* targetend = targetstart + widesize;
ConversionResult res = ConvertUTF8toUTF32(&sourcestart, sourceend, &targetstart, targetend, lenientConversion);
if (res != conversionOK)
{
return std::wstring(utf8string.begin(), utf8string.end());
}
*targetstart = 0;
if(!resultstring.empty() && resultstring.size() > 0) {
std::wstring result = std::wstring(resultstring.c_str());
return result;
} else {
return std::wstring();
}
}
else
{
assert(false);
return L"";
}
return L"";
}
Now this code initially works however crashes soon after due to some issues in the above interfacing code. This interfacing code was adapted from open source code found on GitHub from a production project...
However crashes a few strings into the conversion, so I guess there's a overflow in this code
Does anyone have a good replacement or example code for a simple C++11/C++17 solution to convert a std::string to std::wstring to get UTF32 unicode values encoded
I have a working solution for UTF8 to UTF16 using C++17 Locale:
This seems to do the job for me to convert to the correct level of Unicode to enable extraction of character codes to int to load glyph codes correctly
#include <locale>
#include <codecvt>
#include <string>
std::wstring Utf8_To_wstring(const std::string& utf8string)
{
wstring_convert<codecvt_utf8_utf16<wchar_t>> converter;
wstring utf16;
try {
utf16 = converter.from_bytes(utf8string);
}
catch(range_error e)
{
// log / handle exp
}
return utf16;
}

Convert between wstring and string , got different results with "same" way

I use a function s2ws() (search from the SO,if you find something wrong please let me know)convert from string to wstring,then I use tinyxml2 to read something from xml.As we all know ,some of tinyxml2 interface use char * as input so does the return value.
The reason why convert from string to wstring is the project all using wchar_t types to deal with string.
/*
string converts to wstring
*/
std::wstring s2ws(const std::string& src)
{
std::wstring res = L"";
size_t const wcs_len = mbstowcs(NULL, src.c_str(), 0);
std::vector<wchar_t> buffer(wcs_len + 1);
mbstowcs(&buffer[0], src.c_str(), src.size());
res.assign(buffer.begin(), buffer.end() - 1);
return res;
}
/*
wstring converts to string
*/
std::string ws2s(const std::wstring & src)
{
setlocale(LC_CTYPE, "");
std::string res = "";
size_t const mbs_len = wcstombs(NULL, src.c_str(), 0);
std::vector<char> buffer(mbs_len + 1);
wcstombs(&buffer[0], src.c_str(), buffer.size());
res.assign(buffer.begin(), buffer.end() - 1);
return res;
}
The ClassES-Attribute will return char *,funciton s2ws will convert string to wstring. These two ways got different result in map m_UpdateClassification. The second method is between #if 0 and #endif. But I thinks these two ways should make no difference.
The second method will got empty string after convert,can not figure out why,If you have any clue,please let me know.
typedef std::map<std::wstring, std::wstring> CMapString;
CMapString m_UpdateClassification;
const wchar_t * First = NULL;
const wchar_t * Second = NULL;
const char *name = ClassES->Attribute( "name" );
const char *value = ClassES->Attribute( "value" );
std::wstring wname = s2ws(name);
std::wcout<< wname << std::endl;
First = wname.c_str();
std::wstring wvalue = s2ws(value);
std::wcout<< wvalue << std::endl;
Second = wvalue.c_str();
#if 0
First = s2ws(ClassES->Attribute( "name" )).c_str();
if( !First ) { m_ProdectFamily.clear(); return FALSE; }
Second = s2ws(ClassES->Attribute( "value" )).c_str();
if( !Second ) { m_ProdectFamily.clear(); return FALSE; }
#endif
m_UpdateClassification[Second] = First;
I think I found the reason,I assgin wchar_t * to wstring,After modfiy code like this,everything run well.
std::wstring First = L"";
std::wstring Second = L"";
First = s2ws(ClassES->Attribute("name"));
if( First.empty() ) { m_ProdectFamily.clear(); return FALSE; }
Second = s2ws(ClassES->Attribute("value"));
if( Second.empty() ) { m_ProdectFamily.clear(); return FALSE; }
Another question,Should I check the result of s2ws(mbstowcs) ws2s(wcstombs)?

How to read Unicode string from process in Windows?

I'm trying to read a Unicode string from another process's memory with this code:
Function:
bool ReadWideString(const HANDLE& hProc, const std::uintptr_t& addr, std::wstring& out) {
std::array<wchar_t, maxStringLength> outStr;
auto readMemRes = ReadProcessMemory(hProc, (LPCVOID)addr,(LPVOID)&out, sizeof(out), NULL);
if (!readMemRes)
return false;
else {
out = std::wstring(outStr.data());
}
return true;
}
Call:
std::wstring name;
bool res = ReadWideString(OpenedProcessHandle, address, name);
std::wofstream test("test.txt");
test << name;
test.close();
This is working well with English letters, but when I try to read Cyrillic, it outputs nothing. I tried with std::string, but all I get is just a random junk like "EC9" instead of "Дебил".
I'm using Visual Studio 17 and the C++17 standard.
You can't read directly into the wstring the way you are doing. That will overwrite it's internal data members and corrupt surrounding memory, which would be very bad.
You are allocating a local buffer, but you are not using it for anything. Use it, eg:
bool ReadWideString(HANDLE hProc, std::uintptr_t addr, std::wstring& out) {
std::array<wchar_t, maxStringLength> outStr;
SIZE_T numRead = 0;
if (!ReadProcessMemory(hProc, reinterpret_cast<LPVOID>(addr), &outStr, sizeof(outStr), &numRead))
return false;
out.assign(outStr.data(), numRead / sizeof(wchar_t));
return true;
}
std::wstring name;
if (ReadWideString(OpenedProcessHandle, address, name)) {
std::ofstream test("test.txt", std::ios::binary);
wchar_t bom = 0xFEFF;
test.write(reinterpret_cast<char*>(&bom), sizeof(bom));
test.write(reinterpret_cast<const char*>(name.c_str()), name.size() * sizeof(wchar_t));
}
Alternatively, get rid of the local buffer and preallocate the wstring's memory buffer instead, then you can read directly into it, eg:
bool ReadWideString(HANDLE hProc, std::uintptr_t addr, std::wstring& out) {
out.resize(maxStringLength);
SIZE_T numRead = 0;
if (!ReadProcessMemory(hProc, reinterpret_cast<LPVOID>(addr), &out[0], maxStringLength * sizeof(wchar_t), &numRead)) {
out.clear();
return false;
}
out.resize(numRead / sizeof(wchar_t));
return true;
}
Or
bool ReadWideString(HANDLE hProc, std::uintptr_t addr, std::wstring& out) {
std::wstring outStr;
outStr.resize(maxStringLength);
SIZE_T numRead = 0;
if (!ReadProcessMemory(hProc, reinterpret_cast<LPVOID>(addr), &outStr[0], maxStringLength * sizeof(wchar_t), &numRead))
return false;
outStr.resize(numRead / sizeof(wchar_t));
out = std::move(outStr);
return true;
}

String to LPCWSTR in c++

I'm trying to convert from string to LPCWSTR (I use multi-bite).
1) For example:
LPCWSTR ToLPCWSTR(string text)
{
LPCWSTR sw = (LPCWSTR)text.c_str();
return sw;
}
2) This returns Chinese characters:
LPCWSTR ToLPCWSTR(string text)
{
std::wstring stemp = std::wstring(text.begin(), text.end());
LPCWSTR sw = (LPCWSTR)stemp.c_str();
return sw;
}
However, they both always shows squares:
Image
EDITED:
My code with an edit by: Barmak Shemirani
std::wstring get_utf16(const std::string &str, int codepage)
{
if (str.empty()) return std::wstring();
int sz = MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), 0, 0);
std::wstring res(sz, 0);
MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), &res[0], sz);
return res;
}
string HttpsWebRequest(string domain, string url)
{
LPCWSTR sdomain = get_utf16(domain, CP_UTF8).c_str();
LPCWSTR surl = get_utf16(url, CP_UTF8).c_str();
//(Some stuff...)
}
Return:
https://i.gyazo.com/ea4cd50765bfcbe12c763ea299e7b508.png
EDITED:
Using another code that pass from UTF8 to UTF16, still the same result.
std::wstring utf8_to_utf16(const std::string& utf8)
{
std::vector<unsigned long> unicode;
size_t i = 0;
while (i < utf8.size())
{
unsigned long uni;
size_t todo;
bool error = false;
unsigned char ch = utf8[i++];
if (ch <= 0x7F)
{
uni = ch;
todo = 0;
}
else if (ch <= 0xBF)
{
throw std::logic_error("not a UTF-8 string");
}
else if (ch <= 0xDF)
{
uni = ch & 0x1F;
todo = 1;
}
else if (ch <= 0xEF)
{
uni = ch & 0x0F;
todo = 2;
}
else if (ch <= 0xF7)
{
uni = ch & 0x07;
todo = 3;
}
else
{
throw std::logic_error("not a UTF-8 string");
}
for (size_t j = 0; j < todo; ++j)
{
if (i == utf8.size())
throw std::logic_error("not a UTF-8 string");
unsigned char ch = utf8[i++];
if (ch < 0x80 || ch > 0xBF)
throw std::logic_error("not a UTF-8 string");
uni <<= 6;
uni += ch & 0x3F;
}
if (uni >= 0xD800 && uni <= 0xDFFF)
throw std::logic_error("not a UTF-8 string");
if (uni > 0x10FFFF)
throw std::logic_error("not a UTF-8 string");
unicode.push_back(uni);
}
std::wstring utf16;
for (size_t i = 0; i < unicode.size(); ++i)
{
unsigned long uni = unicode[i];
if (uni <= 0xFFFF)
{
utf16 += (wchar_t)uni;
}
else
{
uni -= 0x10000;
utf16 += (wchar_t)((uni >> 10) + 0xD800);
utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
}
}
return utf16;
}
You have two problems.
LPCWSTR is a pointer to wchar_t, and std::string::c_str() returns a const char*. Those two types are different, so casting from const char* to LPCWSTR won't work.
The memory pointed to by the pointer returned by std::basic_string::c_str is owned by the string object, and is freed when the string goes out of scope.
You will need to allocate memory and make a copy of the string.
The easiest way to allocate memory for a new wide string would be to just return a std::wstring. You can then pass the pointer returned by c_str() to whatever API function takes LPCWSTR:
std::wstring string_to_wstring(const std::string& text) {
return std::wstring(text.begin(), text.end());
}
If std::string source is English or some Latin languages then conversion to std::wstring can be done with simple copy (as shown in Miles Budnek's answer). But in general you have to use MultiByteToWideChar
std::wstring get_utf16(const std::string &str, int codepage)
{
if (str.empty()) return std::wstring();
int sz = MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), 0, 0);
std::wstring res(sz, 0);
MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), &res[0], sz);
return res;
}
You have to know the codepage used to make the source string. You can use GetACP() to find the codepage for user computer. If source string is UTF8 then use CP_UTF8 for codepage.

application crashes at first strcat_s

I have tried both strcat and strcat_s, but they both crash. Does anyone know why this happens? I can't find the problem.
Crash: "Unhandled exception at 0x58636D2A (msvcr110d.dll)"
_Dst 0x00ea6b30 "C:\\Users\\Ruben\\Documents\\School\\" char *
_SizeInBytes 260 unsigned int
_Src 0x0032ef64 "CKV" const char *
available 228 unsigned int
p 0x00ea6b50 "" char *
Code:
#include <Windows.h>
#include <strsafe.h>
extern "C"
{
char* GetFilesInFolders(LPCWSTR filedir, char* path)
{
char* files = "";
char DefChar = ' ';
char* Streepje = "-";
bool LastPoint = false;
WIN32_FIND_DATA ffd;
TCHAR szDir[MAX_PATH];
HANDLE hFind = INVALID_HANDLE_VALUE;
DWORD dwError = 0;
StringCchCopy(szDir, MAX_PATH, filedir);
hFind = FindFirstFile(szDir, &ffd);
if (INVALID_HANDLE_VALUE == hFind)
return "";
do
{
DWORD attributes = ffd.dwFileAttributes;
LPCWSTR nm = ffd.cFileName;
char name[260];
WideCharToMultiByte(CP_ACP,0,ffd.cFileName,-1, name,260,&DefChar, NULL);
for (int i = 0; i <= 260; i++)
{
if (name[i] == '.')
LastPoint = true;
else if (name[i] == ' ')
break;
}
if (LastPoint == true)
{
LastPoint = false;
continue;
}
if (attributes & FILE_ATTRIBUTE_HIDDEN)
{
continue;
}
else if (attributes & FILE_ATTRIBUTE_DIRECTORY)
{
char* newfiledir = "";
char* newpath = path;
char* add = "\\";
char* extra = "*";
strcat_s(newpath, sizeof(name), name);
strcat_s(newpath, sizeof(add), add);
puts(newpath);
strcpy_s(newfiledir, sizeof(newpath) + 1, newpath);
strcat_s(newfiledir, sizeof(extra) + 1, extra);
puts(newfiledir);
size_t origsize = strlen(newfiledir) + 1;
const size_t newsize = 100;
size_t convertedChars = 0;
wchar_t wcstring[newsize];
mbstowcs_s(&convertedChars, wcstring, origsize, newfiledir, _TRUNCATE);
LPCWSTR dir = wcstring;
GetFilesInFolders(dir, newpath);
}
else
{
char* file = path;
strcat_s(file, sizeof(name), name);
puts(file);
strcat_s(files, sizeof(file), file);
strcat_s(files, sizeof(Streepje), Streepje);
puts(files);
}
}
while (FindNextFile(hFind, &ffd) != 0);
FindClose(hFind);
return files;
}
}
int _tmain(int argc, _TCHAR* argv[])
{
char* path = "C:\\Users\\Ruben\\Documents\\School\\";
char* filedir = "C:\\Users\\Ruben\\Documents\\School\\*";
size_t origsize = strlen(filedir) + 1;
const size_t newsize = 100;
size_t convertedChars = 0;
wchar_t wcstring[newsize];
mbstowcs_s(&convertedChars, wcstring, origsize, filedir, _TRUNCATE);
LPCWSTR dir = wcstring;
char* files = GetFilesInFolders(dir, path);
return 0;
}
Extra info: I don't want to use boost or strings and I want to keep this in unicode (default).
You assign a const char* to files, then attempt to append to it.
char* files = "";
// ...
strcat_s(files, sizeof(file), file);
You cannot modify a constant string literal.
I would recommend that you turn on compiler warnings and make sure to look at them. This would warn you about assigning a const char* to a char*. To fix it, you might have changed files to be const, which would then cause your strcpy_s to no longer compile.
It looks like you don't understand how variables are stored in memory or how pointers work. In your _tmain() you have char * path pointing to a constant string literal, which you pass into GetFilesInFolders(), where it gets modified. Compilers tend to allow char *s to point at constant strings for backward compatibility with old C programs. You cannot modify these. You cannot append to them. The compiler (generally) puts these in a read-only segment. That's one reason why you're getting an exception.
Your whole GetFilesInFolders() is wrong. And as DarkFalcon pointed out, you haven't allocated any space anywhere for files, you have it pointing to a constant string literal.
Get "The C++ Programming Language" and read chapter 5.