How to convert from UTF-8 to ANSI using standard c++

How to convert from UTF-8 to ANSI using standard c++ - c++

I have some strings read from the database, stored in a char* and in UTF-8 format (you know, "á" is encoded as 0xC3 0xA1). But, in order to write them to a file, I first need to convert them to ANSI (can't make the file in UTF-8 format... it's only read as ANSI), so that my "á" doesn't become "Ã¡". Yes, I know some data will be lost (chinese characters, and in general anything not in the ANSI code page) but that's exactly what I need.
But the thing is, I need the code to compile in various platforms, so it has to be standard C++ (i.e. no Winapi, only stdlib, stl, crt or any custom library with available source).
Anyone has any suggestions?

A few days ago, somebody answered that if I had a C++11 compiler, I could try this:
#include <string>
#include <codecvt>
#include <locale>
string utf8_to_string(const char *utf8str, const locale& loc)
{
// UTF-8 to wstring
wstring_convert<codecvt_utf8<wchar_t>> wconv;
wstring wstr = wconv.from_bytes(utf8str);
// wstring to string
vector<char> buf(wstr.size());
use_facet<ctype<wchar_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', buf.data());
return string(buf.data(), buf.size());
}
int main(int argc, char* argv[])
{
string ansi;
char utf8txt[] = {0xc3, 0xa1, 0};
// I guess you want to use Windows-1252 encoding...
ansi = utf8_to_string(utf8txt, locale(".1252"));
// Now do something with the string
return 0;
}
Don't know what happened to the response, apparently someone deleted it. But, turns out that it is the perfect solution. To whoever posted, thanks a lot, and you deserve the AC and upvote!!

If you mean ASCII, just discard any byte that has bit 7 set, this will remove all multibyte sequences. Note that you could create more advanced algorithms, like removing the accent from the "á", but that would require much more work.

This should work:
#include <string>
#include <codecvt>
using namespace std::string_literals;
std::string to_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
using wcvt = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>;
std::u32string wstr(str.size(), U'\0');
std::use_facet<std::ctype<char32_t>>(loc).widen(str.data(), str.data() + str.size(), &wstr[0]);
return wcvt{}.to_bytes(wstr.data(),wstr.data() + wstr.size());
}
std::string from_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
using wcvt = std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t>;
auto wstr = wcvt{}.from_bytes(str);
std::string result(wstr.size(), '0');
std::use_facet<std::ctype<char32_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', &result[0]);
return result;
}
int main() {
auto s0 = u8"Blöde C++ Scheiße äöü!!1Elf"s;
auto s1 = from_utf8(s0);
auto s2 = to_utf8(s1);
return 0;
}
For VC++:
#include <string>
#include <codecvt>
using namespace std::string_literals;
std::string to_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
using wcvt = std::wstring_convert<std::codecvt_utf8<int32_t>, int32_t>;
std::u32string wstr(str.size(), U'\0');
std::use_facet<std::ctype<char32_t>>(loc).widen(str.data(), str.data() + str.size(), &wstr[0]);
return wcvt{}.to_bytes(
reinterpret_cast<const int32_t*>(wstr.data()),
reinterpret_cast<const int32_t*>(wstr.data() + wstr.size())
);
}
std::string from_utf8(const std::string& str, const std::locale& loc = std::locale{}) {
using wcvt = std::wstring_convert<std::codecvt_utf8<int32_t>, int32_t>;
auto wstr = wcvt{}.from_bytes(str);
std::string result(wstr.size(), '0');
std::use_facet<std::ctype<char32_t>>(loc).narrow(
reinterpret_cast<const char32_t*>(wstr.data()),
reinterpret_cast<const char32_t*>(wstr.data() + wstr.size()),
'?', &result[0]);
return result;
}
int main() {
auto s0 = u8"Blöde C++ Scheiße äöü!!1Elf"s;
auto s1 = from_utf8(s0);
auto s2 = to_utf8(s1);
return 0;
}

#include <stdio.h>
#include <string>
#include <codecvt>
#include <locale>
#include <vector>
using namespace std;
std::string utf8_to_string(const char *utf8str, const locale& loc){
// UTF-8 to wstring
wstring_convert<codecvt_utf8<wchar_t>> wconv;
wstring wstr = wconv.from_bytes(utf8str);
// wstring to string
vector<char> buf(wstr.size());
use_facet<ctype<wchar_t>>(loc).narrow(wstr.data(), wstr.data() + wstr.size(), '?', buf.data());
return string(buf.data(), buf.size());
}
int main(int argc, char* argv[]){
std::string ansi;
char utf8txt[] = {0xc3, 0xa1, 0};
// I guess you want to use Windows-1252 encoding...
ansi = utf8_to_string(utf8txt, locale(".1252"));
// Now do something with the string
return 0;
}

Related

Get temp path with file name

I want to get path to file like this > %ENV%/%FILE_NAME%.docx
But c++ doesn't make sense at all and nothing works..
I would use std::string but it's not compatible so I tried multiple ways of converting it to char[] or char* but none of them works and I'm also pretty sure this is unsafe..
My code so far (I know it's the worst code ever..)
char* appendCharToCharArray(char* array, char a)
{
size_t len = strlen(array);
char* ret = new char[len + 2];
strcpy(ret, array);
ret[len] = a;
ret[len + 1] = '\0';
return ret;
}
const char* getBaseName(std::string path)
{
std::string base_filename = path.substr(path.find_last_of("/\\") + 1);
std::string::size_type const p(base_filename.find_last_of('.'));
std::string file_without_extension = base_filename.substr(0, p);
return file_without_extension.c_str();
}
int main()
{
char szExeFileName[MAX_PATH];
GetModuleFileName(NULL, szExeFileName, MAX_PATH);
const char* file_name = getBaseName(std::string(szExeFileName));
char* new_file = getenv("temp");
new_file = appendCharToCharArray(new_file, '\\');
for (int i=0;i<sizeof(file_name)/sizeof(file_name[0]);i++)
{
new_file = appendCharToCharArray(new_file, file_name[i]);
}
new_file = appendCharToCharArray(new_file, '.');
new_file = appendCharToCharArray(new_file, 'd');
new_file = appendCharToCharArray(new_file, 'o');
new_file = appendCharToCharArray(new_file, 'c');
new_file = appendCharToCharArray(new_file, 'x');
std::cout << new_file << std::endl;
}

Using appendCharToCharArray() is just horribly inefficient in general, and also you are leaking lots of memory with the way you are using it. Just use std::string instead. And yes, you can use std::string in this code, it is perfectly "compatible" if you use it correctly.
getBaseName() is returning a char* pointer to the data of a local std::string variable that goes out of scope when the function exits, thus a dangling pointer is returned. Again, use std::string instead.
And, you should use the Win32 GetTempPath/2() function instead of getenv("temp").
Try something more like this:
#include <iostream>
#include <string>
std::string getBaseName(const std::string &path)
{
std::string base_filename = path.substr(path.find_last_of("/\\") + 1);
std::string::size_type const p(base_filename.find_last_of('.'));
std::string file_without_extension = base_filename.substr(0, p);
return file_without_extension;
}
int main()
{
char szExeFileName[MAX_PATH] = {};
GetModuleFileNameA(NULL, szExeFileName, MAX_PATH);
char szTempFolder[MAX_PATH] = {};
GetTempPathA(MAX_PATH, szTempFolder);
std::string new_file = std::string(szTempFolder) + getBaseName(szExeFileName) + ".docx";
std::cout << new_file << std::endl;
}
Online Demo
That being said, the Win32 Shell API has functions for manipulating path strings, eg:
#include <iostream>
#include <string>
#include <windows.h>
#include <shlwapi.h>
#pragma comment(lib, "Shlwapi.lib")
int main()
{
char szExeFileName[MAX_PATH] = {};
GetModuleFileNameA(NULL, szExeFileName, MAX_PATH);
char szTempFolder[MAX_PATH] = {};
GetTempPathA(MAX_PATH, szTempFolder);
char new_file[MAX_PATH] = {};
PathCombineA(new_file, szTempFolder, PathFindFileNameA(szExeFileName));
PathRenameExtensionA(new_file, ".docx");
std::cout << new_file << std::endl;
}
Or, if you are using C++17 or later, consider using std::filesystem::path instead, eg:
#include <iostream>
#include <filesystem>
#include <windows.h>
namespace fs = std::filesystem;
int main()
{
char szExeFileName[MAX_PATH] = {};
GetModuleFileNameA(NULL, szExeFileName, MAX_PATH);
char szTempFolder[MAX_PATH] = {};
GetTempPathA(MAX_PATH, szTempFolder);
fs::path new_file = fs::path(szTempFolder) / fs::path(szExeFileName).stem();
new_file += ".docx";
// alternatively:
// fs::path new_file = fs::path(szTempFolder) / fs::path(szExeFileName).filename();
// new_file.replace_extension(".docx");
std::cout << new_file << std::endl;
}
Online Demo

std::length_error: basic_string converting wstring to string [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a task to convert a tool written in windows to run in Linux, so I have to write a function convert string to wstring or convert wstring to string in Linux.
Due to I am not familiar with c++,when I call Linux API, I have to change string to char* like function IsFileExist below.
If I remove function setlocale, error message below:
libc++abi.dylib: terminating with uncaught exception of type std::length_error: basic_string
Question: Is this correct to use setlocale? Actually after google still confused about this.
Here is the code:
#include <iostream>
#include <vector>
#include <sys/stat.h>
#include <unistd.h>
#include <string>
#include <fstream>
/*
string converts to wstring
*/
std::wstring s2ws(const std::string& src)
{
std::wstring res = L"";
size_t const wcs_len = mbstowcs(NULL, src.c_str(), 0);
std::vector<wchar_t> buffer(wcs_len + 1);
mbstowcs(&buffer[0], src.c_str(), src.size());
res.assign(buffer.begin(), buffer.end() - 1);
return res;
}
/*
wstring converts to string
*/
std::string ws2s(std::wstring const & src)
{
setlocale(LC_CTYPE, "");
std::string res = "";
size_t const mbs_len = wcstombs(NULL, src.c_str(), 0);
std::vector<char> buffer(mbs_len + 1);
wcstombs(&buffer[0], src.c_str(), buffer.size());
res.assign(buffer.begin(), buffer.end() - 1);
return res;
}
int IsFileExist(const std::wstring& name ) {
struct stat buffer;
/*convert wstring to string,then to C style char* */
std::string str = ws2s(name.c_str());
char *cstr = new char[str.length() + 1];
strncpy(cstr, str.c_str(),str.size());
/*judge if file exist*/
if(0 == stat(cstr,&buffer))
{
delete [] cstr;
return 1;
}
else
{
delete [] cstr;
return 0;
}
}
int main()
{
std::wstring str=L"chines中文œ∑®";
std::string res = ws2s(str);
std::cout<<res<<std::endl;
char dst[]="abcdef";
std::wstring fun = s2ws(dst);
std::wcout<<fun<<std::endl;
std::wstring file=L"/Users/coder52/Downloads/mac.zip";
std::cout << IsFileExist(file) << std::endl;
return 0;
}

You should not do it platform dependant. Please make it C++17 style and use std::filesystem::path for the path name and check the file existance with std::filesystem::exists. std::filesystem::path is able to handle both char* and wchar_t*.

How to convert unicode code points to utf-8 in c++?

I have an array consisting of unicode code points
unsigned short array[3]={0x20ac,0x20ab,0x20ac};
I just want this to be converted as utf-8 to write into file byte by byte using C++.
Example:
0x20ac should be converted to e2 82 ac.
or is there any other method that can directly write unicode characters in file.

Finally! With C++11!
#include <string>
#include <locale>
#include <codecvt>
#include <cassert>
int main()
{
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
std::string u8str = converter.to_bytes(0x20ac);
assert(u8str == "\xe2\x82\xac");
}

The term Unicode refers to a standard for encoding and handling of text. This incorporates encodings like UTF-8, UTF-16, UTF-32, UCS-2, ...
I guess you are programming in a Windows environment, where Unicode typically refers to UTF-16.
When working with Unicode in C++, I would recommend the ICU library.
If you are programming on Windows, don't want to use an external library, and have no constraints regarding platform dependencies, you can use WideCharToMultiByte.
Example for ICU:
#include <iostream>
#include <unicode\ustream.h>
using icu::UnicodeString;
int main(int, char**) {
//
// Convert from UTF-16 to UTF-8
//
std::wstring utf16 = L"foobar";
UnicodeString str(utf16.c_str());
std::string utf8;
str.toUTF8String(utf8);
std::cout << utf8 << std::endl;
}
To do exactly what you want:
// Assuming you have ICU\include in your include path
// and ICU\lib(64) in your library path.
#include <iostream>
#include <fstream>
#include <unicode\ustream.h>
#pragma comment(lib, "icuio.lib")
#pragma comment(lib, "icuuc.lib")
void writeUtf16ToUtf8File(char const* fileName, wchar_t const* arr, size_t arrSize) {
UnicodeString str(arr, arrSize);
std::string utf8;
str.toUTF8String(utf8);
std::ofstream out(fileName, std::ofstream::binary);
out << utf8;
out.close();
}

Following code may help you,
#include <atlconv.h>
#include <atlstr.h>
#define ASSERT ATLASSERT
int main()
{
const CStringW unicode1 = L"\x0391 and \x03A9"; // 'Alpha' and 'Omega'
const CStringA utf8 = CW2A(unicode1, CP_UTF8);
ASSERT(utf8.GetLength() > unicode1.GetLength());
const CStringW unicode2 = CA2W(utf8, CP_UTF8);
ASSERT(unicode1 == unicode2);
}

This code uses WideCharToMultiByte (I assume that you are using Windows):
unsigned short wide_str[3] = {0x20ac, 0x20ab, 0x20ac};
int utf8_size = WideCharToMultiByte(CP_UTF8, 0, wide_str, 3, NULL, 0, NULL, NULL) + 1;
char* utf8_str = calloc(utf8_size);
WideCharToMultiByte(CP_UTF8, 0, wide_str, 3, utf8_str, utf8_size, NULL, NULL);
You need to call it twice: first time to get number of output bytes, and second time to actually convert it. If you know output buffer size, you may skip first call. Or, you can simply allocate buffer 2x larger than original + 1 byte (for your case it means 12+1 bytes) - it should be always enough.

With std c++
#include <iostream>
#include <locale>
#include <vector>
int main()
{
typedef std::codecvt<wchar_t, char, mbstate_t> Convert;
std::wstring w = L"\u20ac\u20ab\u20ac";
std::locale locale("en_GB.utf8");
const Convert& convert = std::use_facet<Convert>(locale);
std::mbstate_t state;
const wchar_t* from_ptr;
char* to_ptr;
std::vector<char> result(3 * w.size() + 1, 0);
Convert::result convert_result = convert.out(state,
w.c_str(), w.c_str() + w.size(), from_ptr,
result.data(), result.data() + result.size(), to_ptr);
if (convert_result == Convert::ok)
std::cout << result.data() << std::endl;
else std::cout << "Failure: " << convert_result << std::endl;
}

Iconv is a popular library used on many platforms.

I had a similar but slightly different problem. I had strings with the Unicode code point in it as a string representation. Ex: "F\u00f3\u00f3 B\u00e1r". I needed to convert the string code points to their Unicode character.
Here is my C# solution
using System.Globalization;
using System.Text.RegularExpressions;
static void Main(string[] args)
{
Regex CodePoint = new Regex(#"\\u(?<UTF32>....)");
Match Letter;
string s = "F\u00f3\u00f3 B\u00e1r";
string utf32;
Letter = CodePoint.Match(s);
while (Letter.Success)
{
utf32 = Letter.Groups[1].Value;
if (Int32.TryParse(utf32, NumberStyles.HexNumber, CultureInfo.GetCultureInfoByIetfLanguageTag("en-US"), out int HexNum))
s = s.Replace("\\u" + utf32, Char.ConvertFromUtf32(HexNum));
Letter = Letter.NextMatch();
}
Console.WriteLine(s);
}
Output: Fóó Bár

How do I convert a string to a wstring using the value of the string?

I'm new to C++ and I have this issue. I have a string called DATA_DIR that I need for format into a wstring.
string str = DATA_DIR;
std::wstring temp(L"%s",str);
Visual Studio tells me that there is no instance of constructor that matches with the argument list. Clearly, I'm doing something wrong.
I found this example online
std::wstring someText( L"hello world!" );
which apparently works (no compile errors). My question is, how do I get the string value stored in DATA_DIR into the wstring constructor as opposed to something arbitrary like "hello world"?

Here is an implementation using wcstombs (Updated):
#include <iostream>
#include <cstdlib>
#include <string>
std::string wstring_from_bytes(std::wstring const& wstr)
{
std::size_t size = sizeof(wstr.c_str());
char *str = new char[size];
std::string temp;
std::wcstombs(str, wstr.c_str(), size);
temp = str;
delete[] str;
return temp;
}
int main()
{
std::wstring wstr = L"abcd";
std::string str = wstring_from_bytes(wstr);
}
Here is a demo.

This is in reference to the most up-voted answer but I don't have enough "reputation" to just comment directly on the answer.
The name of the function in the solution "wstring_from_bytes" implies it is doing what the original poster wants, which is to get a wstring given a string, but the function is actually doing the opposite of what the original poster asked for and would more accurately be named "bytes_from_wstring".
To convert from string to wstring, the wstring_from_bytes function should use mbstowcs not wcstombs
#define _CRT_SECURE_NO_WARNINGS
#include <iostream>
#include <cstdlib>
#include <string>
std::wstring wstring_from_bytes(std::string const& str)
{
size_t requiredSize = 0;
std::wstring answer;
wchar_t *pWTempString = NULL;
/*
* Call the conversion function without the output buffer to get the required size
* - Add one to leave room for the NULL terminator
*/
requiredSize = mbstowcs(NULL, str.c_str(), 0) + 1;
/* Allocate the output string (Add one to leave room for the NULL terminator) */
pWTempString = (wchar_t *)malloc( requiredSize * sizeof( wchar_t ));
if (pWTempString == NULL)
{
printf("Memory allocation failure.\n");
}
else
{
// Call the conversion function with the output buffer
size_t size = mbstowcs( pWTempString, str.c_str(), requiredSize);
if (size == (size_t) (-1))
{
printf("Couldn't convert string\n");
}
else
{
answer = pWTempString;
}
}
if (pWTempString != NULL)
{
delete[] pWTempString;
}
return answer;
}
int main()
{
std::string str = "abcd";
std::wstring wstr = wstring_from_bytes(str);
}
Regardless, this is much more easily done in newer versions of the standard library (C++ 11 and newer)
#include <locale>
#include <codecvt>
#include <string>
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::wstring wide = converter.from_bytes(narrow_utf8_source_string);

printf-style format specifiers are not part of the C++ library and cannot be used to construct a string.
If the string may only contain single-byte characters, then the range constructor is sufficient.
std::string narrower( "hello" );
std::wstring wider( narrower.begin(), narrower.end() );
The problem is that we usually use wstring when wide characters are applicable (hence the w), which are represented in std::string by multibyte sequences. Doing this will cause each byte of a multibyte sequence to translate to an sequence of incorrect wide characters.
Moreover, to convert a multibyte sequence requires knowing its encoding. This information is not encapsulated by std::string nor std::wstring. C++11 allows you to specify an encoding and translate using std::wstring_convert, but I'm not sure how widely supported it is of yet. See 0x....'s excellent answer.

The converter mentioned for C++11 and above has deprecated this specific conversion in C++17, and suggests using the MultiByteToWideChar function.
The compiler error (c4996) mentions defining _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING.

wstring temp = L"";
for (auto c : DATA_DIR)
temp.push_back(c);

I found this function. Could not find any predefined method to do this.
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
std::wstring stemp = s2ws(myString);

Convert WCHAR[260] to std::string

I have gotten a WCHAR[MAX_PATH] from (PROCESSENTRY32) pe32.szExeFile on Windows. The following do not work:
std::string s;
s = pe32.szExeFile; // compile error. cast (const char*) doesnt work either
and
std::string s;
char DefChar = ' ';
WideCharToMultiByte(CP_ACP,0,pe32.szExeFile,-1, ch,260,&DefChar, NULL);
s = pe32.szExeFile;

For your first example you can just do:
std::wstring s(pe32.szExeFile);
and for second:
char DefChar = ' ';
WideCharToMultiByte(CP_ACP,0,pe32.szExeFile,-1, ch,260,&DefChar, NULL);
std::wstring s(pe32.szExeFile);
as std::wstring has a char* ctor

Your call to WideCharToMultiByte looks correct, provided ch is a
sufficiently large buffer. After than, however, you want to assign the
buffer (ch) to the string (or use it to construct a string), not
pe32.szExeFile.

There are convenient conversion classes from ATL; you may want to use some of them, e.g.:
std::string s( CW2A(pe32.szExeFile) );
Note however that a conversion from Unicode UTF-16 to ANSI can be lossy. If you wan't a non-lossy conversion, you could convert from UTF-16 to UTF-8, and store UTF-8 inside std::string.
If you don't want to use ATL, there are some convenient freely available C++ wrappers around raw Win32 WideCharToMultiByte to convert from UTF-16 to UTF-8 using STL strings.

#ifndef __STRINGCAST_H__
#define __STRINGCAST_H__
#include <vector>
#include <string>
#include <cstring>
#include <cwchar>
#include <cassert>
template<typename Td>
Td string_cast(const wchar_t* pSource, unsigned int codePage = CP_ACP);
#endif // __STRINGCAST_H__
template<>
std::string string_cast( const wchar_t* pSource, unsigned int codePage )
{
assert(pSource != 0);
size_t sourceLength = std::wcslen(pSource);
if(sourceLength > 0)
{
int length = ::WideCharToMultiByte(codePage, 0, pSource, sourceLength, NULL, 0, NULL, NULL);
if(length == 0)
return std::string();
std::vector<char> buffer( length );
::WideCharToMultiByte(codePage, 0, pSource, sourceLength, &buffer[0], length, NULL, NULL);
return std::string(buffer.begin(), buffer.end());
}
else
return std::string();
}
and use this template as followed
PWSTR CurWorkDir;
std::string CurWorkLogFile;
CurWorkDir = new WCHAR[length];
CurWorkLogFile = string_cast<std::string>(CurWorkDir);
....
delete [] CurWorkDir;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to convert from UTF-8 to ANSI using standard c++ - c++

If you mean ASCII, just discard any byte that has bit 7 set, this will remove all multibyte sequences. Note that you could create more advanced algorithms, like removing the accent from the "á", but that would require much more work.

Related

Get temp path with file name

std::length_error: basic_string converting wstring to string [closed]

How to convert unicode code points to utf-8 in c++?

How do I convert a string to a wstring using the value of the string?

Convert WCHAR[260] to std::string

Categories

Resources