I'm reading a UTF-8 encoded unicode text file, and outputting it into the console, but the displayed characters are not the same as in the text editor i used to create the file. Here is my code :
#define UNICODE
#include <windows.h>
#include <iostream>
#include <fstream>
#include <string>
#include "pugixml.hpp"
using std::ifstream;
using std::ios;
using std::string;
using std::wstring;
int main( int argc, char * argv[] )
{
ifstream oFile;
try
{
string sContent;
oFile.open ( "../config-sample.xml", ios::in );
if( oFile.is_open() )
{
wchar_t wsBuffer[128];
while( oFile.good() )
{
oFile >> sContent;
mbstowcs( wsBuffer, sContent.c_str(), sizeof( wsBuffer ) );
//wprintf( wsBuffer );// Same result as wcout.
wcout << wsBuffer;
}
Sleep(100000);
}
else
{
throw L"Failed to open file";
}
}
catch( const wchar_t * pwsMsg )
{
::MessageBox( NULL, pwsMsg, L"Error", MB_OK | MB_TOPMOST | MB_SETFOREGROUND );
}
if( oFile.is_open() )
{
oFile.close();
}
return 0;
}
There must be something i don't get about encoding.
The problem is that a mbstowcs doesn't actually use UTF-8. It uses an older style of "multibyte codepoints", which is not compatible with UTF-8 (although technically is is possible [I believe] to define a UTF-8 codepage, there is no such thing in Windows).
If you want to convert UTF-8 to UTF-16, you can use MultiByteToWideChar, with a codepage of CP_UTF8.
Wide strings don't mean UTF-8. In fact, it's quite the opposite: UTF-8 means Unicode Transformation Format (8 bits); it's a way to represent Unicode over 8-bit characters, so your normal chars. You should read it into normal strings (not wide strings).
Wide strings use wchar_t, which on Windows is 16 bits. The OS uses UTF-16 for its "wide" functions.
On Windows, UTF-8 strings can be converted to UTF-16 using MultiByteToWideChar.
I made a C++ char_t container that hold up to 6 8-bit char_t storing it in a std::vector. Converting it to and from wchar_t or appending it to a std::string.
Check it out here:
View UTF-8_String structures on Github
#include "UTF-8_String.h" //header from github link above
iBS::u8str raw_v;
iBS::readu8file("TestUTF-8File.txt",raw_v);
std::cout<<raw_v.str()<<std::endl;
Here is functions that converts wchar_t to a uint32_t in the u8char struct fond in header above.
#include <cwchar>
u8char& operator=(wchar_t& wc)
{
char temp[6];
std::mbstate_t state ;
int ret = std::wcrtomb((&temp[0]), wc, &state);
ref.resize(ret);
for (short i=0; i<ret; ++i)
ref[i]=temp[i];
return *this;
};
I find wifstream works very good, even in visual studio debugger shows UTF-8 words correctly (I'm reading traditional chinese words), from this post:
#include <sstream>
#include <fstream>
#include <codecvt>
std::wstring readFile(const char* filename)
{
std::wifstream wif(filename);
wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::wstringstream wss;
wss << wif.rdbuf();
return wss.str();
}
// usage
std::wstring wstr2;
wstr2 = readFile("C:\\yourUtf8File.txt");
wcout << wstr2;
Related
This question already has answers here:
Read Unicode UTF-8 file into wstring
(7 answers)
Closed 1 year ago.
I need you help.
I'm using Windows 10 and Visual Studio Community compiler.
I managed to get Lithuanian letter to show on C++ console application using wstring and wcout.
#include <iostream>
#include <io.h>
#include <fcntl.h>
using namespace std;
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
wstring a = L"ąėėąčėį";
wcout << a;
return 0;
}
Result is exactly what I wanted it to be
Now I want my program to read Lithuanian letters from Info.txt file.
This is how far I managed to get.
#include <iostream>
#include <fstream>
#include <io.h>
#include <fcntl.h>
#include <string>
using namespace std;
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
wstring text;
wifstream fin("Info.txt");
getline(fin, text);
wcout << text;
return 0;
}
Returned string in console application shows different simbols.
But the returned string in console application shows different simbols.
In my belief a possible solution
I need to add L before the text like in previous example with wcout.
wstring a = L"ąėėąčėį";
But I'm still just learning C++ and I don't know how to do so in example with Info.txt
I need your help!
UTF8 needs std::ifstream, not wifstream. The latter is used in Windows as UTF16 file storage (not recommended in any system)
You can use SetConsoleOutputCP(CP_UTF8) to enable UTF8 printing, but that can run in to problems, specially in C++ 20
Instead, call _setmode and convert UTF8 to UTF16.
Make sure notepad saves the file in UTF8 (encoding option is available in Save window)
#include <iostream>
#include <fstream>
#include <string>
#include <io.h>
#include <fcntl.h>
#include <Windows.h>
std::wstring u16(const std::string u8)
{
if (u8.empty()) return std::wstring();
int size = MultiByteToWideChar(CP_UTF8, 0, u8.c_str(), -1, 0, 0);
std::wstring u16(size, 0);
MultiByteToWideChar(CP_UTF8, 0, u8.c_str(), -1, u16.data(), size);
return u16;
}
int main()
{
(void)_setmode(_fileno(stdout), _O_U16TEXT);
std::string text;
std::ifstream fin("Info.txt");
if (fin)
while (getline(fin, text))
std::wcout << u16(text) << "\n";
return 0;
}
Sorry for what may sound simple, but I am trying to draw just a simple box in Visual Studio 2017 using the unicode characters from https://en.wikipedia.org/wiki/Box-drawing_character using the code below
#include <iostream>
using namespace std;
int main()
{
cout << "┏━━━━━━━━━━━━━━━━━┓" << endl;
cout << "┃" << endl;
and so on...
However, whenever I run it all of the above code simply outputs as a ? wherever there should be a line.
So is it possible to output code like this directly to the console or for each character do I have to write the numeric values for each character?
Windows console supports UTF-16LE UNICODE.
You can use some box-driving library like PDCurses for example.
Otherwise you can use the following approach
#include <windows.h>
#include <cwchar>
class output_swap {
output_swap(const output_swap&) = delete;
output_swap operator=(output_swap&) = delete;
public:
output_swap( ) noexcept:
prevCP_( ::GetConsoleCP() )
{
::SetConsoleCP( CP_WINUNICODE );
::SetConsoleOutputCP( CP_WINUNICODE );
}
~output_swap() noexcept {
::SetConsoleCP( prevCP_ );
::SetConsoleOutputCP( prevCP_ );
}
private:
::DWORD prevCP_;
};
void draw_text(const wchar_t* text)
{
static ::HANDLE _out = ::GetStdHandle(STD_OUTPUT_HANDLE);
::DWORD written;
::WriteConsoleW( _out, text, std::wcslen(text), &written, nullptr );
}
int main(int argc, const char** argv) {
output_swap swap;
draw_text(L"┏━━━━━━━━━━━━━━━━━┓\n");
draw_text(L"┃ OK ┃\n");
draw_text(L"┗━━━━━━━━━━━━━━━━━┛\n");
return 0;
}
Also check you console font, in the console settings. You are probably need a raster font, but this is also working for Consolas for example.
If you need console io streams, which can work with unicode as well as box driwing you can use my library
Windows console apps can output wide strings (L"...") directly to the terminal if the mode is set correctly. Note the use of wcout as well. Save the following source in UTF-8 encoding:
#include <iostream>
#include <io.h>
#include <fcntl.h>
using namespace std;
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
wcout << L"┏━━━━━━━━━━━━━━━━━┓" << endl;
wcout << L"┃" << endl;
}
Compile with "cl /EHsc /utf-8 test.cpp". Output is:
┏━━━━━━━━━━━━━━━━━┓
┃
#include <iostream>
#include <Windows.h>
#include <locale>
#include <string>
#include <codecvt>
typedef wchar_t* LPWSTR, *PWSTR;
template <typename Facet>
struct deletable_facet : Facet
{
using Facet::Facet;
};
int main(int argc, char *argv[])
{
std::cout << argv[0] << std::endl;
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
//std::wcout << converter.from_bytes(argv[0]) << std::endl; // range error
std::wstring_convert<deletable_facet<std::codecvt<wchar_t, char, std::mbstate_t>>> conv;
std::wstring ns = conv.from_bytes(argv[0]);
std::wcout << ns << std::endl;
wchar_t filename[MAX_PATH];
//GetModuleFileName(NULL,filename,MAX_PATH); // cant convert wstring_t* to char*
GetModuleFileNameW(NULL,filename,MAX_PATH);
std::wcout << filename << std::endl;
getchar();
return 0;
}
Output:
C:\Users\luka\Desktop\ⁿ?icΣ\unicode.exe
C:\Users\luka\Desktop\ⁿ?icΣ\unicode.exe
C:\Users\luka\Desktop\ⁿ
Actual name of the folder is üлicä
Ive been trying many many different ways for about 2 hours now, and as far as ive seen people suggested GetModuleFileName , but as you can see that returns a conversion error (typedef wchar_t* LPWSTR, *PWSTR; isnt fixing it).
So is there any way to to get the current folder path in unicode , and get the rest of the input arguments to unicode (non-latin characters)
The usage for GetModuleFileName is correct. You should see the expected result with MessageBoxW(0, filename, 0, 0);
The problem is in printing L"üлicä" on Windows console.
Try printing "üлicä" on the console:
int main(int argc, char *argv[])
{
DWORD count;
std::wstring str = GetCommandLineW() + (std::wstring)L"\n";
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str.c_str(), str.size(), &count, 0);
MessageBoxW(0, str.c_str(), 0, 0);
wchar_t filename[MAX_PATH];
GetModuleFileNameW(0, filename, MAX_PATH);
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), filename, wcslen(filename), &count, 0);
return 0;
}
In Visual Studio you can also use _setmode to enable usage of std::wcout/std::wcin
You also have optional entry point wmain(int argc, wchar_t *argv[]) which provides argv in UTF16 encoding.
The main entry point provides argv in ANSI encoding (not UTF8 encoding). ANSI can loose information, unlike Unicode.
This probably is related not to the program but the console, I suggest you try to output into a file and check if the encoding is correct.
You can do that using freopen:
int main(int argc, char *argv[]){
freopen("output-file-name.txt", "w", stdout);
/*rest of code*/
}
If problem persists, try using visual studio along with _setmode(..., _O_U16TEXT) just before using wcout as described here: https://stackoverflow.com/a/9051543/9541897
Here's an example that works with Windows. You'll have to find the right compiler/linker settings to support wmain on MinGW, but it will work. _setmode enables writing Unicode directly to the terminal, and should work as long as the font supports the characters. In my example I use some Chinese, which my font supports:
#include <Windows.h>
#include <iostream>
#include "fcntl.h"
#include "io.h"
int wmain(int argc, wchar_t* argv[])
{
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << argv[0] << std::endl;
wchar_t filename[MAX_PATH];
GetModuleFileNameW(NULL,filename,MAX_PATH);
std::wcout << filename << std::endl;
return 0;
}
Output:
马克.exe
C:\üлicä\马克.exe
Why are you typedefing LPWSTR and PWSTR manually? windows.h already handles that for you.
In any case, as #n.m. said in comments, the arguments for main() are NOT encoded in UTF-8 on Windows, so converting non-ASCII characters using a UTF8->UTF16 converter will not produce the correct output. Use the Win32 MultiByteToWideChar() function instead to convert the arguments, using CP_ACP as the codepage to convert from. Or, use wmain() instead, which provides arguments as wchar_t* instead of as char*.
That will get you the data you want. Then, you just need to deal with the issue of Unicode output to the console. As other answers point out, the Windows console does not support UTF-16 output via std::wcout by default, so you have to jump through some additional hoops to make it work correctly (there are many other questions on StackOverflow about that issue).
In MSVC++, if you create a new Visual Studio console application (x64 platform, running on Windows 8.1, x64), and set it to a Unicode character set with the following code in main:
int _tmain(int argc, _TCHAR* argv[])
{
stringstream stream;
stream << _T("Testing Unicode. English - Ελληνικά - Español.") << std::endl;
string str = stream.str();
std::wcout << str.c_str();
cin.get();
}
It outputs this:
00007FF616443E50
I would like it to output this instead:
Testing Unicode. English - Ελληνικά - Español.
How can this be achieved?
Edit: With wstringstream and wstring instead:
wstringstream stream; stream << _T("Testing Unicode. English - Ελληνικά - Español.") << std::endl;
wstring str = stream.str();
std::wcout << str.c_str();
The output is truncated:
Testing Unicode. English -
Setting the mode like so: _setmode(_fileno(stdout), _O_U16TEXT);
The output is still undesirable because not all characters get rendered properly:
Testing Unicode. English - ???????? - Español.
Setting the output CP like so: SetConsoleOutputCP(CP_UTF8);
The output is again truncated:
Testing Unicode. English -
Using the following just doesn't work alone.. What you must also do is right click the Visual Studio console that pops up. Click Default Properties. Click the Fonts tab and set the font to Lucida Consolas. Then the below code will run just fine. Without the overloads of the << operator for windows, it will NOT work. You may also want to make an overload for char or wchar_t or simply make this a template overload..
If you do not like the overloads, you may use _setmode(_fileno(stdout), _O_U16TEXT); or _setmode(_fileno(stdout), _O_U8TEXT); for UTF16 and UTF8 respectfully.
// Unicode.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <sstream>
#include <iostream>
#if defined _WIN32 || defined _WIN64
#include <Windows.h>
#else
#include <io.h>
#include <fcntl.h>
#endif
#if defined _WIN32 || defined _WIN64
std::ostream& operator << (std::ostream& os, const char* data)
{
SetConsoleOutputCP(CP_UTF8);
DWORD slen = strlen(data);
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), data, slen, &slen, nullptr);
return os;
}
std::ostream& operator << (std::ostream& os, const std::string& data)
{
SetConsoleOutputCP(CP_UTF8);
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), data.size(), nullptr, nullptr);
return os;
}
std::wostream& operator <<(std::wostream& os, const wchar_t* data)
{
DWORD slen = wcslen(data);
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), data, slen, &slen, nullptr);
return os;
}
std::wostream& operator <<(std::wostream& os, const std::wstring& data)
{
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), data.c_str(), data.size(), nullptr, nullptr);
return os;
}
#endif
int _tmain(int argc, _TCHAR* argv[])
{
std::wstringstream stream;
stream << _T("Testing Unicode. English - Ελληνικά - Español.") << std::endl;
#if !defined _WIN32 && !defined _WIN64
_setmode(_fileno(stdout), _O_U16TEXT);
#endif
std::wstring str = stream.str();
std::wcout << str;
std::wcin.get();
return 0;
}
On Windows there is ONE more thing that can help render fonts in ANY language.. I found this was not posted anywhere else on the net.. I navigated to Control Panel\Appearance and Personalization\Fonts. I clicked Font Settings and then unchecked Hide fonts based on language settings. Saved the options. This will allow you to write Japanese and Chinese characters as well as arabic and whatever other languages you want. Seems to work with the default console fonts as well.. I had to restart for it to take effect though. Not sure if it actually works for anyone else..
I have an wide-character string (std::wstring) in my code, and I need to search wide character in it.
I use find() function for it:
wcin >> str;
wcout << ((str.find(L'ф') != wstring::npos)? L"EXIST":L"NONE");
L'ф' is a Cyrillic letter.
But find() in same call always returns npos. In a case with Latin letters find() works correctly.
It is a problem of this function?
Or I incorrectly do something?
UPD
I use MinGW and save source in UTF-8.
I also set locale with setlocale(LC_ALL, "");.
Code same wcout << L'ф'; works coorectly.
But same
wchar_t w;
wcin >> w;
wcout << w;
works incorrectly.
It is strange. Earlier I had no problems with the encoding, using setlocale ().
The encoding of your source file and the execution environment's encoding may be wildly different. C++ makes no guarantees about any of this. You can check this by outputting the hexadecimal value of your string literal:
std::wcout << std::hex << L"ф";
Before C++11, you could use non-ASCII characters in source code by using their hex values:
"\x05" "five"
C++11 adds the ability to specify their Unicode value, which in your case would be
L"\u03A6"
If you're going full C++11 (and your environment ensures these are encoded in UTF-*), you can use any of char, char16_t, or char32_t, and do:
const char* phi_utf8 = "\u03A6";
const char16_t* phi_utf16 = u"\u03A6";
const char32_t* phi_utf16 = U"\u03A6";
You must set the encoding of the console.
This works:
#include <iostream>
#include <string>
#include <io.h>
#include <fcntl.h>
#include <stdio.h>
using namespace std;
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
_setmode(_fileno(stdin), _O_U16TEXT);
wstring str;
wcin >> str;
wcout << ((str.find(L'ф') != wstring::npos)? L"EXIST":L"NONE");
system("pause");
return 0;
}
std::wstring::find() works fine. But you have to read the input string correctly.
The following code runs fine on Windows console (the input Unicode string is read using ReadConsoleW() Win32 API):
#include <exception>
#include <iostream>
#include <sstream>
#include <stdexcept>
#include <string>
#include <windows.h>
using namespace std;
class Win32Error : public runtime_error
{
public:
Win32Error(const char* message, DWORD error)
: runtime_error(message)
, m_error(error)
{}
DWORD Error() const
{
return m_error;
}
private:
DWORD m_error;
};
void ThrowLastWin32(const char* message)
{
const DWORD error = GetLastError();
throw Win32Error(message, error);
}
void Test()
{
const HANDLE hStdIn = GetStdHandle(STD_INPUT_HANDLE);
if (hStdIn == INVALID_HANDLE_VALUE)
ThrowLastWin32("GetStdHandle failed.");
static const int kBufferLen = 200;
wchar_t buffer[kBufferLen];
DWORD numRead = 0;
if (! ReadConsoleW(hStdIn, buffer, kBufferLen, &numRead, nullptr))
ThrowLastWin32("ReadConsoleW failed.");
const wstring str(buffer, numRead - 2);
static const wchar_t kEf = 0x0444;
wcout << ((str.find(kEf) != wstring::npos) ? L"EXIST" : L"NONE");
}
int main()
{
static const int kExitOk = 0;
static const int kExitError = 1;
try
{
Test();
return kExitOk;
}
catch(const Win32Error& e)
{
cerr << "\n*** ERROR: " << e.what() << '\n';
cerr << " (GetLastError returned " << e.Error() << ")\n";
return kExitError;
}
catch(const exception& e)
{
cerr << "\n*** ERROR: " << e.what() << '\n';
return kExitError;
}
}
Output:
C:\TEMP>test.exe
abc
NONE
C:\TEMP>test.exe
abcфabc
EXIST
That's probably an encoding issue. wcin works with an encoding different from your compiler's/source code's. Try entering the ф in the console/wcin -- it will work. Try printing the ф via wcout -- it will show a different character or no character at all.
There is no platform independent way to circumvent this, but if you are on windows, you can manually change the console encoding, either with the chchp commandline command or programmatically with SetConsoleCP() (input) and SetConsoleOutputCP() (output).
You could also change your source file's/compiler's encoding. How this is done depends on your editor/compiler. If you are using MSVC, this answer might help you: https://stackoverflow.com/a/1660901/2128694