C++ Read UTF-8 (Lithuanian letters) symbols from txt file and show them in console application [duplicate] - c++

This question already has answers here:
Read Unicode UTF-8 file into wstring
(7 answers)
Closed 1 year ago.
I need you help.
I'm using Windows 10 and Visual Studio Community compiler.
I managed to get Lithuanian letter to show on C++ console application using wstring and wcout.
#include <iostream>
#include <io.h>
#include <fcntl.h>
using namespace std;
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
wstring a = L"ąėėąčėį";
wcout << a;
return 0;
}
Result is exactly what I wanted it to be
Now I want my program to read Lithuanian letters from Info.txt file.
This is how far I managed to get.
#include <iostream>
#include <fstream>
#include <io.h>
#include <fcntl.h>
#include <string>
using namespace std;
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
wstring text;
wifstream fin("Info.txt");
getline(fin, text);
wcout << text;
return 0;
}
Returned string in console application shows different simbols.
But the returned string in console application shows different simbols.
In my belief a possible solution
I need to add L before the text like in previous example with wcout.
wstring a = L"ąėėąčėį";
But I'm still just learning C++ and I don't know how to do so in example with Info.txt
I need your help!

UTF8 needs std::ifstream, not wifstream. The latter is used in Windows as UTF16 file storage (not recommended in any system)
You can use SetConsoleOutputCP(CP_UTF8) to enable UTF8 printing, but that can run in to problems, specially in C++ 20
Instead, call _setmode and convert UTF8 to UTF16.
Make sure notepad saves the file in UTF8 (encoding option is available in Save window)
#include <iostream>
#include <fstream>
#include <string>
#include <io.h>
#include <fcntl.h>
#include <Windows.h>
std::wstring u16(const std::string u8)
{
if (u8.empty()) return std::wstring();
int size = MultiByteToWideChar(CP_UTF8, 0, u8.c_str(), -1, 0, 0);
std::wstring u16(size, 0);
MultiByteToWideChar(CP_UTF8, 0, u8.c_str(), -1, u16.data(), size);
return u16;
}
int main()
{
(void)_setmode(_fileno(stdout), _O_U16TEXT);
std::string text;
std::ifstream fin("Info.txt");
if (fin)
while (getline(fin, text))
std::wcout << u16(text) << "\n";
return 0;
}

Related

Converting Turkish-I letter to lowercase using boost in CPP

Since a few days I was trying to get a C++ code that converts the Turkish I character to lowercase ı correctly on VS2022 on Windows.
As I understand, Turkish I has the same Unicode as regular Latin I, thus, I need to define the locale as Turkish before converting, I used the following code:
#include <clocale>
#include <cwctype>
#include <fstream>
#include <iostream>
#include <locale>
#include <string>
int main() {
std::wstring input_str = L"I";
std::setlocale(LC_ALL, "tr_TR.UTF-8"); // This should impact std::towlower
std::locale loc("tr_TR.UTF-8");
std::wofstream output_file("lowercase_turkish.txt");
output_file.imbue(loc);
for (wchar_t& c : input_str) {
c = std::towlower(c);
}
output_file << input_str << std::endl;
output_file.close();
}
It worked fine on Linux, outputing ı, but didn't work correctly on Windows and it outputed i inplace of ı.
After some research I think it is a bug in Windows unicode/ascii mapping, so I went to an alternative solution, using an external library called boost, here is my code:
#include <boost/algorithm/string.hpp>
#include <string>
#include <locale>
#include <iostream>
#include <fstream>
using namespace std;
using namespace boost::algorithm;
int main()
{
std::string s = "I";
std::locale::global(std::locale{ "Turkish" });
to_lower(s);
ofstream outfile("output.txt");
outfile << s << endl;
outfile.close();
return 0;
}
again, outputing i inplace of ı. also using to_lower_copy outputs the same.

std::wcout printing unicode characters but they are hidden

So, the following code:
#include <iostream>
#include <string>
#include <io.h>
#include <fcntl.h>
#include <codecvt>
int main()
{
setlocale(LC_ALL, "");
std::wstring a;
std::wcout << L"Type a string: " << std::endl;
std::getline(std::wcin, a);
std::wcout << a << std::endl;
getchar();
}
When I type "åäö" I get some weird output. The terminal's cursor is indented, but there is no text behind it. If I use my right arrow key to move the cursor forward the "åäö" reveal themselves as I click the right arrow key.
If I include English letters so that the input is "helloåäö" the output is "hello" but as I click my right arrow key "helloåäö" appears letter by letter.
Why does this happen and more importantly how can I fix it?
Edit: I compile with Visual Studio's compiler on Windows. When I tried this exact code in repl.it (they use clang) it works like a charm. Is the problem caused by my code, Windows or Visual Studio?
Windows requires some OS-specific calls to set up the console for Unicode:
#include <iostream>
#include <string>
#include <io.h>
#include <fcntl.h>
// From fctrl.h:
// #define _O_U16TEXT 0x20000 // file mode is UTF16 no BOM (translated)
// #define _O_WTEXT 0x10000 // file mode is UTF16 (translated)
int main()
{
_setmode(_fileno(stdout), _O_WTEXT); // or _O_U16TEXT, either work
_setmode(_fileno(stdin), _O_WTEXT);
std::wstring a;
std::wcout << L"Type a string: ";
std::getline(std::wcin, a);
std::wcout << a << std::endl;
getwchar();
}
Output:
Type a string: helloåäö马克
helloåäö马克

How to output unicode box drawing in C++?

Sorry for what may sound simple, but I am trying to draw just a simple box in Visual Studio 2017 using the unicode characters from https://en.wikipedia.org/wiki/Box-drawing_character using the code below
#include <iostream>
using namespace std;
int main()
{
cout << "┏━━━━━━━━━━━━━━━━━┓" << endl;
cout << "┃" << endl;
and so on...
However, whenever I run it all of the above code simply outputs as a ? wherever there should be a line.
So is it possible to output code like this directly to the console or for each character do I have to write the numeric values for each character?
Windows console supports UTF-16LE UNICODE.
You can use some box-driving library like PDCurses for example.
Otherwise you can use the following approach
#include <windows.h>
#include <cwchar>
class output_swap {
output_swap(const output_swap&) = delete;
output_swap operator=(output_swap&) = delete;
public:
output_swap( ) noexcept:
prevCP_( ::GetConsoleCP() )
{
::SetConsoleCP( CP_WINUNICODE );
::SetConsoleOutputCP( CP_WINUNICODE );
}
~output_swap() noexcept {
::SetConsoleCP( prevCP_ );
::SetConsoleOutputCP( prevCP_ );
}
private:
::DWORD prevCP_;
};
void draw_text(const wchar_t* text)
{
static ::HANDLE _out = ::GetStdHandle(STD_OUTPUT_HANDLE);
::DWORD written;
::WriteConsoleW( _out, text, std::wcslen(text), &written, nullptr );
}
int main(int argc, const char** argv) {
output_swap swap;
draw_text(L"┏━━━━━━━━━━━━━━━━━┓\n");
draw_text(L"┃ OK ┃\n");
draw_text(L"┗━━━━━━━━━━━━━━━━━┛\n");
return 0;
}
Also check you console font, in the console settings. You are probably need a raster font, but this is also working for Consolas for example.
If you need console io streams, which can work with unicode as well as box driwing you can use my library
Windows console apps can output wide strings (L"...") directly to the terminal if the mode is set correctly. Note the use of wcout as well. Save the following source in UTF-8 encoding:
#include <iostream>
#include <io.h>
#include <fcntl.h>
using namespace std;
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
wcout << L"┏━━━━━━━━━━━━━━━━━┓" << endl;
wcout << L"┃" << endl;
}
Compile with "cl /EHsc /utf-8 test.cpp". Output is:
┏━━━━━━━━━━━━━━━━━┓
┃

Saving a file in a different place for linux

I am trying to save a file somewhere else than the folder of the exe. I have pieced together this unelegant way:
#include <stdio.h>
#include <iostream>
#include <fstream>
#include <string>
#include <unistd.h>
using namespace std;
int main() {
//getting current path of the executable
char executable_path[256];
getcwd(executable_path, 255);
//unelegant back and forth conversion to add a different location
string file_loction_as_string;
file_loction_as_string = executable_path;
file_loction_as_string += "/output_files/hello_world.txt"; //folder has to exist
char *file_loction_as_char = const_cast<char*>(file_loction_as_string.c_str());
// creating, writing, closing file
ofstream output_file(file_loction_as_char);
output_file << "hello world!";
output_file.close();
}
Is there a more elegant way to do this? So that the char-string-char* is not necessary.
Also is it possible to create the output folder in the process apart from mkdir?
Thank you
You can get rid of 3 lines of code if you use the following.
int main()
{
//getting current path of the executable
char executable_path[256];
getcwd(executable_path, 255);
//unelegant back and forth conversion to add a different location
string file_loction_as_string = string(executable_path) + "/output_files/hello_world.txt";
// creating, writing, closing file
ofstream output_file(file_loction_as_string.c_str());
output_file << "hello world!";
output_file.close();
}

C++ / wcout / UTF-8

I'm reading a UTF-8 encoded unicode text file, and outputting it into the console, but the displayed characters are not the same as in the text editor i used to create the file. Here is my code :
#define UNICODE
#include <windows.h>
#include <iostream>
#include <fstream>
#include <string>
#include "pugixml.hpp"
using std::ifstream;
using std::ios;
using std::string;
using std::wstring;
int main( int argc, char * argv[] )
{
ifstream oFile;
try
{
string sContent;
oFile.open ( "../config-sample.xml", ios::in );
if( oFile.is_open() )
{
wchar_t wsBuffer[128];
while( oFile.good() )
{
oFile >> sContent;
mbstowcs( wsBuffer, sContent.c_str(), sizeof( wsBuffer ) );
//wprintf( wsBuffer );// Same result as wcout.
wcout << wsBuffer;
}
Sleep(100000);
}
else
{
throw L"Failed to open file";
}
}
catch( const wchar_t * pwsMsg )
{
::MessageBox( NULL, pwsMsg, L"Error", MB_OK | MB_TOPMOST | MB_SETFOREGROUND );
}
if( oFile.is_open() )
{
oFile.close();
}
return 0;
}
There must be something i don't get about encoding.
The problem is that a mbstowcs doesn't actually use UTF-8. It uses an older style of "multibyte codepoints", which is not compatible with UTF-8 (although technically is is possible [I believe] to define a UTF-8 codepage, there is no such thing in Windows).
If you want to convert UTF-8 to UTF-16, you can use MultiByteToWideChar, with a codepage of CP_UTF8.
Wide strings don't mean UTF-8. In fact, it's quite the opposite: UTF-8 means Unicode Transformation Format (8 bits); it's a way to represent Unicode over 8-bit characters, so your normal chars. You should read it into normal strings (not wide strings).
Wide strings use wchar_t, which on Windows is 16 bits. The OS uses UTF-16 for its "wide" functions.
On Windows, UTF-8 strings can be converted to UTF-16 using MultiByteToWideChar.
I made a C++ char_t container that hold up to 6 8-bit char_t storing it in a std::vector. Converting it to and from wchar_t or appending it to a std::string.
Check it out here:
View UTF-8_String structures on Github
#include "UTF-8_String.h" //header from github link above
iBS::u8str raw_v;
iBS::readu8file("TestUTF-8File.txt",raw_v);
std::cout<<raw_v.str()<<std::endl;
Here is functions that converts wchar_t to a uint32_t in the u8char struct fond in header above.
#include <cwchar>
u8char& operator=(wchar_t& wc)
{
char temp[6];
std::mbstate_t state ;
int ret = std::wcrtomb((&temp[0]), wc, &state);
ref.resize(ret);
for (short i=0; i<ret; ++i)
ref[i]=temp[i];
return *this;
};
I find wifstream works very good, even in visual studio debugger shows UTF-8 words correctly (I'm reading traditional chinese words), from this post:
#include <sstream>
#include <fstream>
#include <codecvt>
std::wstring readFile(const char* filename)
{
std::wifstream wif(filename);
wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
std::wstringstream wss;
wss << wif.rdbuf();
return wss.str();
}
// usage
std::wstring wstr2;
wstr2 = readFile("C:\\yourUtf8File.txt");
wcout << wstr2;