Example:
#include <iostream>
using namespace std;
int main()
{
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет"; //Russian language
cout << ru
<< endl
<< en;
return 0;
}
This code only prints HEX-values like adress.
How to print the wchar_t string?
Edit: This doesn’t work if you are trying to write text that cannot be represented in your default locale. :-(
Use std::wcout instead of std::cout.
wcout << ru << endl << en;
Can I suggest std::wcout ?
So, something like this:
std::cout << "ASCII and ANSI" << std::endl;
std::wcout << L"INSERT MULTIBYTE WCHAR* HERE" << std::endl;
You might find more information in a related question here.
You cannot portably print wide strings using standard C++ facilities.
Instead you can use the open-source {fmt} library to portably print Unicode text. For example (https://godbolt.org/z/nccb6j):
#include <fmt/core.h>
int main() {
const char en[] = "Hello";
const char ru[] = "Привет";
fmt::print("{}\n{}\n", ru, en);
}
prints
Привет
Hello
This requires compiling with the /utf-8 compiler option in MSVC.
For comparison, writing to wcout on Linux:
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет";
std::wcout << ru << std::endl << en;
may transliterate the Russian text into Latin (https://godbolt.org/z/za5zP8):
Privet
Hello
This particular issue can be fixed by switching to a locale that uses UTF-8 but a similar problem exists on Windows that cannot be fixed just with standard facilities.
Disclaimer: I'm the author of {fmt}.
Windows has the very confusing information. You should learn C/C++ concept from Unix/Linux before programming in Windows.
wchar_t stores character in UTF-16 which is a fixed 16-bit memory size called wide character but wprintf() or wcout() will never print non-english wide characters correctly because no console will output in UTF-16. Windows will output in current locale while unix/linux will output in UTF-8, all are multi-byte. So you have to convert wide characters to multi-byte before printing. The unix command wcstombs() doesn't work on Windows, use WideCharToMultiByte() instead.
First you need to convert file to UTF-8 using notepad or other editor. Then install font in command prompt console so that it can read/write in your language and change code page in console to UTF-8 to display correctly by typing in the command prompt "chcp 65001" while cygwin is already default to UTF-8. Here is what I did in Thai.
#include <windows.h>
#include <stdio.h>
int main()
{
wchar_t* in=L"ทดสอบ"; // thai language
char* out=(char *)malloc(15);
WideCharToMultiByte(874, 0, in, 15, out, 15, NULL, NULL);
printf(out); // result is correctly in Thai although not neat
}
Note that
874=(Thai) code page in the operating system, 15=size of string
My suggestion is to avoid printing non-english wide characters to console unless necessary because it is not easy.
#include <iostream>
using namespace std;
void main()
{
setlocale(LC_ALL, "Russian");
cout << "\tДОБРО ПОЖАЛОВАТЬ В КИНО!\n";
}
The way to do it is to convert UTF-16 LE (Default Windows encoding) into UTF-8, and then print to console (chcp 65001 first, to switch codepage to UTF-8).
It's pretty trivial to convert UTF-16 to UTF-8. Use this page as a guide, if you need more than 2 byte characters.
short* cmd_s = (short*)cmd;
while(cmd_s[i] != 0)
{
short u16 = cmd_s[i++];
if(u16 > 0x7F)
{
unsigned char c0 = ((char)u16 & 0x3F) | 0x80; // Least significant
unsigned char c1 = char(((u16 >> 6) & 0x1F) | 0xC0); // Most significant
cout << c1 << c0; // Use Big-endian network order
}
else
{
unsigned char c0 = (char)u16;
cout << c0;
}
}
Of course, you can put it in a function and extend it to handle wider characters (For Cyrillic it should be enough), but I wanted to show basic algorithm, and to prove that it's not hard at all and you don't need any libraries, just a few lines of code.
You could use use a normal char array that is actually filled with utf-8 characters. This should allow mixing characters across languages.
You can print wide characters with wprintf.
#include <iostream>
int main()
{
wchar_t en[] = L"Hello";
wchar_t ru[] = L"Привет"; //Russian language
wprintf(en);
wprintf(ru);
return 0;
}
Output:
Hello
Привет
Related
So I wanted to try converting Unicode to an integer for a project of mine. I tried something like this :
unsigned int foo = (unsigned int)L'آ';
std::cout << foo << std::endl;
How do I convert it back? Or in other words, How do I convert an int to the respective Unicode character ?
EDIT : I am expecting the output to be the unicode value of an integer, example:
cout << (wchar_t) 1570 ; // This should print the unicode value of 1570 (which is :آ)
I am using Visual Studio 2013 Community with it's default compiler, Windows 10 64 bit Pro
Cheers
L'آ' will work okay as a signle wide character, because it is below 0xFFFF. But in general UTF16 includes surrogate pairs, so a unicode code point cannot be represented with a single wide character. You need wide string instead.
Your problem is also partly to do with printing UTF16 character in Windows console. If you use MessageBoxW to view a wide string it will work as expected:
wchar_t buf[2] = { 0 };
buf[0] = 1570;
MessageBoxW(0, buf, 0, 0);
However, in general you need a wide string to account for surrogate pairs, not a single wide char. Example:
int utf32 = 1570;
const int mask = (1 << 10) - 1;
std::wstring str;
if(utf32 < 0xFFFF)
{
str.push_back((wchar_t)utf32);
}
else
{
utf32 -= 0x10000;
int hi = (utf32 >> 10) & mask;
int lo = utf32 & mask;
hi += 0xD800;
lo += 0xDC00;
str.push_back((wchar_t)hi);
str.push_back((wchar_t)lo);
}
MessageBox(0, str.c_str(), 0, 0);
See related posts for printing UTF16 in Windows console.
The key here is setlocale(LC_ALL, "en_US.UTF-8");. en_US is the localization string which you may want to set to a different value like zh_CN for Chinese for example.
#include <stdio.h>
#include <iostream>
int main() {
setlocale(LC_ALL, "en_US.UTF-8");
// This does not work without setlocale(LC_ALL, "en_US.UTF-8");
for(int ch=30000; ch<30030; ch++) {
wprintf(L"%lc", ch);
}
printf("\n");
return 0;
}
Things to notice here is the use of wprintf and how the formatted string is given: L"%lc" which tells wprintf to treat the string and the character as long characters.
If you want to use this method to print some variables, use the type wchat_t.
Useful links:
setlocale
wprintf
I ask a code snippet which cin a unicode text, concatenates another unicode one to the first unicode text and the cout the result.
P.S. This code will help me to solve another bigger problem with unicode. But before the key thing is to accomplish what I ask.
ADDED: BTW I can't write in the command line any unicode symbol when I run the executable file. How I should do that?
I had a similar problem in the past, in my case imbue and sync_with_stdio did the trick. Try this:
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
ios_base::sync_with_stdio(false);
wcin.imbue(locale("en_US.UTF-8"));
wcout.imbue(locale("en_US.UTF-8"));
wstring s;
wstring t(L" la Polynésie française");
wcin >> s;
wcout << s << t << endl;
return 0;
}
Depending on what type unicode you mean. I assume you mean you are just working with std::wstring though. In that case use std::wcin and std::wcout.
For conversion between encodings you can use your OS functions like for Win32: WideCharToMultiByte, MultiByteToWideChar or you can use a library like libiconv
Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.
#ifndef UNICODE
#define UNICODE
#endif
#ifndef _UNICODE
#define _UNICODE
#endif
#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN
#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
#include <conio.h>
#include <windows.h>
void testIostream();
void testStdio();
void testConio();
void testWindows();
int wmain() {
testIostream();
testStdio();
testConio();
testWindows();
std::system("pause");
}
void testIostream() {
std::wstring first, second;
std::getline(std::wcin, first);
if (!std::wcin.good()) return;
std::getline(std::wcin, second);
if (!std::wcin.good()) return;
std::wcout << first << second << std::endl;
}
void testStdio() {
wchar_t buffer[0x1000];
if (!_getws_s(buffer)) return;
const std::wstring first = buffer;
if (!_getws_s(buffer)) return;
const std::wstring second = buffer;
const std::wstring result = first + second;
_putws(result.c_str());
}
void testConio() {
wchar_t buffer[0x1000];
std::size_t numRead = 0;
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring first(buffer, numRead);
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second + L'\n';
_cputws(result.c_str());
}
void testWindows() {
const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
WCHAR buffer[0x1000];
DWORD numRead = 0;
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring first(buffer, numRead - 2);
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second;
const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD numWritten = 0;
WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
Edit 1: I've added a method based on conio.
Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.
If you have actual text (i.e., a string of logical characters), then insert to the wide streams instead. The wide streams will automatically encode your characters to match the bits expected by the locale encoding. (And if you have encoded bits instead, the streams will decode the bits, then re-encode them to match the locale.)
There is a lesser solution if you KNOW you have UTF-encoded bits (i.e., an array of bits intended to be decoded into a string of logical characters) AND you KNOW the target of the output stream is expecting that very same bit-format, then you can skip the decoding and re-encoding steps and write() the bits as-is. This only works when you know both sides use the same encoding format, which may be the case for small utilities not intended to communicate with processes in other locales.
It depends on the OS. If your OS understands you can simply send it UTF-8 sequences.
I had a problem with opening UTF-8 path files. Path that has a UTF-8 char (like Cyrillic or Latin). I found a way to solve that with _wfopen but the way a solved it was when I encode the UTF-8 char with UTF by hand (\Uxxxx).
Is there a function, macro or anything that when I supply the string (path) it will return the Unicode??
Something like this:
https://www.branah.com/unicode-converter
I tried with MultiByteToWideChar but it returns some Hex numbers that are not relavent.
Tried:
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
std::wstring stemp = s2ws(x);
LPCWSTR result = stemp.c_str();
The result I get: 0055F7E8
Thank you in advance
Update:
I installed boost, and now I am trying to do it with boost. Can some one maybe help me out with boost.
So I have a path:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
Here's a way to convert between UTF-8 and UTF-16 on Windows, as well as showing the real values of the stored code units for both input and output:
#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>
int main() {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::string s = "test";
std::cout << std::hex << std::setfill('0');
std::cout << "Input `char` data: ";
for (char c : s) {
std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
}
std::cout << '\n';
std::wstring ws = convert.from_bytes(s);
std::cout << "Output `wchar_t` data: ";
for (wchar_t wc : ws) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Understanding the real values of the input and output is important because otherwise you may not correctly understand the transformation that you really need. For example it looks to me like there may be some confusion as to how VC++ deals with encodings, and what \Uxxxxxxxx and \uxxxx actually do in C++ source code (e.g., they don't necessarily produce UTF-8 data).
Try using code like that shown above to see what your input data really is.
To emphasize what I've written above; there are strong indications that you may not correctly understand the processing that's being done on your input, and you need to thoroughly check it.
The above program does correctly transform the UTF-8 representation of ć (U+0107) into the single 16-bit code unit 0x0107, if you replace the test string with the following:
std::string s = "\xC4\x87"; // UTF-8 representation of U+0107
The output of the program, on Windows using Visual Studio, is then:
Input char data: c4 87
Output wchar_t data: 0107
This is in contrast to if you use test strings such as:
std::string s = "ć";
Or
std::string s = "\u0107";
Which may result in the following output:
Input char data: 3f
Output wchar_t data: 003f
The problem here is that Visual Studio does not use UTF-8 as the encoding for strings without some trickery, so your request to convert from UTF-8 probably isn't what you actually need; or you do need conversion from UTF-8, but you're testing potential conversion routines using input that differs from your real input.
So I have a path: wchar_t path[100] = _T("čaćšžđ\test.txt");
I need it converted to:
wchar_t s[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\test.txt");
Okay, so if I understand correctly, your actual problem is that the following fails:
wchar_t path[100] = _T("čaćšžđ\\test.txt");
FILE *f = _wfopen(path, L"w");
But if you instead write the string like:
wchar_t path[100] = _T("\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt");
Then the _wfopen call succeeds and opens the file you want.
First of all, this has absolutely nothing to do with UTF-8. I assume you found some workaround using a char string and converting that to wchar_t and you somehow interpreted this as involving UTF-8, or something.
What encoding are you saving the source code with? Is the string L"čaćšžđ\\test.txt" actually being saved properly? Try closing the source file and reopening it. If some characters show up replaced by ?, then part of your problem is the source file encoding. In particular this is true of the default encoding used by Windows in most of North America and Western Europe: "Western European (Windows) - Codepage 1252".
You can also check the output of the following program:
#include <iomanip>
#include <iostream>
int main() {
wchar_t path[16] = L"čaćšžđ\\test.txt";
std::cout << std::hex << std::setfill('0');
for (wchar_t wc : path) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
wchar_t s[16] = L"\u010d\u0061\u0107\u0161\u017e\u0111\\test.txt";
for (wchar_t wc : s) {
std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
}
std::cout << '\n';
}
Another thing you need to understand is that the \uxxxx form of writing characters, called Universal Character Names or UCNs, is not a form that you can convert strings to and from in C++. By the time you've compiled the program and it's running, i.e. by the time any code you write could be attempting to produce strings containing \uxxxx, the time when UCNs are interpreted by the compiler as different characters is long past. The only UCNs that will work are ones that are written directly in the source file.
Also, you're using _T() incorrectly. IMO You shouldn't be using TCHAR and the related macros at all, but if you do use it then you ought to use it consistently: don't mix TCHAR APIs with explicit use of the *W APIs or wchar_t. The whole point of TCHAR is to allow code to be independent and switch between those wchar_t and Microsoft's "ANSI" APIs, so using TCHAR and then hard coding an assumption that TCHAR is wchar_t defeats the entire purpose.
You should just write:
wchar_t path[100] = L"čaćšžđ\\test.txt";
Your code is Windows-specific, and you're using Visual C++. So, just use wide literals. Visual C++ supports wide strings for file stream constructors.
It's as simple as that ‐ when you don't require portability.
#include <fstream>
#include <iostream>
#include <stdlib.h>
using namespace std;
auto main() -> int
{
wchar_t const path[] = L"cacšžd/test.txt";
ifstream f( path );
int ch;
while( (ch = f.get()) != EOF )
{
cout.put( ch );
}
}
Note, however, that this code is Visual C++ specific. That's reasonable for Windows-specific code. Possibly with C++17 we will have Boost file system library adopted into the standard library, and then for conformance g++ will ideally offer the constructor used here.
The problem was that I was saving the CPP file as ANSI... I had to convert it to UTF-8. I tried this before posting but VS 2015 turns it into ANSI, I had to change it in VS so I could get it working.
I tried opening the cpp file with notepad++ and changing the encoding but when I turn on VS it automatically returns. So I was looking to Save As option but there is no encoding option. Finally i found it, in Visual Studio 2015
File -> Advanced Save Options in the Encoding dropdown change it to Unicode
One thing that is still strange to me, how did VS display the characters normally but when I opened the file in N++ there was ? (like it was supposed to be, because of ANSI)?
I have the lovely functions from my previous question, which work fine if I do this:
wstring temp;
wcin >> temp;
string whatever( toUTF8(getSomeWString()) );
// store whatever, copy, but do not use it as UTF8 (see below)
wcout << toUTF16(whatever) << endl;
The original form is reproduced, but the in between form often contains extra characters. If I enter for example àçé as the input, and add a cout << whatever statement, i'll get ┬à┬ç┬é as output.
Can I still use this string to compare to others, procured from an ASCII source? Or asked differently: if I would output ┬à┬ç┬é through the UTF8 cout in linux, would it read àçé? Is the byte content of a string àçé, read in UTF8 linux by cin, exactly the same as what the Win32 API gets me?
Thanks!
PS: the reason I'm asking is because I need to use the string a lot to compare to other read values (comparing and concatenating...).
Let's start by me saying that it appears that there is simply no way to output UTF-8 text to the console in Windows via cout (assuming you compile with Visual Studio).
What you can do however for your tests is to output your UTF-8 text via the Win32 API fn WriteConsoleA:
if(!SetConsoleOutputCP(CP_UTF8)) { // 65001
cerr << "Failed to set console output mode!\n";
return 1;
}
HANDLE const consout = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD nNumberOfCharsWritten;
const char* utf8 = "Umlaut AE = \xC3\x84 / ue = \xC3\xBC \n";
if(!WriteConsoleA(consout, utf8, strlen(utf8), &nNumberOfCharsWritten, NULL)) {
DWORD const err = GetLastError();
cerr << "WriteConsole failed with << " << err << "!\n";
return 1;
}
This should output:
Umlaut AE = Ä / ue = ü if you set your console (cmd.exe) to use the Lucida Console font.
As for your question (taken from your comment) if
a win23 API converted string is the
same as a raw UTF8 (linux) string
I will say yes: Given a Unicode character sequence, it's UTF-16 (Windows wchar_t) representation converted to a UTF-8 (char) representation via the WideCharToMultiByte function will always yield the same byte sequence.
When you convert the string to a UTF 16 it is a 16 byte wide character, you can't compare it to the ASCII values because they aren't 16 byte values. You have to convert them to compare, or write a specialized comparision to ASCII function.
I doubt the UTF8 cout in linux would produce the same correct output unless it were regular ASCII values, as UTF8 UTF-8 encoding forms are binary-compatible with ASCII for code points below 128, and I assume UTF16 comes after UTF8 in a simliar fashion.
The good news is there are many converters out there written to convert these strings to different character sets.
I ask a code snippet which cin a unicode text, concatenates another unicode one to the first unicode text and the cout the result.
P.S. This code will help me to solve another bigger problem with unicode. But before the key thing is to accomplish what I ask.
ADDED: BTW I can't write in the command line any unicode symbol when I run the executable file. How I should do that?
I had a similar problem in the past, in my case imbue and sync_with_stdio did the trick. Try this:
#include <iostream>
#include <locale>
#include <string>
using namespace std;
int main() {
ios_base::sync_with_stdio(false);
wcin.imbue(locale("en_US.UTF-8"));
wcout.imbue(locale("en_US.UTF-8"));
wstring s;
wstring t(L" la Polynésie française");
wcin >> s;
wcout << s << t << endl;
return 0;
}
Depending on what type unicode you mean. I assume you mean you are just working with std::wstring though. In that case use std::wcin and std::wcout.
For conversion between encodings you can use your OS functions like for Win32: WideCharToMultiByte, MultiByteToWideChar or you can use a library like libiconv
Here is an example that shows four different methods, of which only the third (C conio) and the fourth (native Windows API) work (but only if stdin/stdout aren't redirected). Note that you still need a font that contains the character you want to show (Lucida Console supports at least Greek and Cyrillic). Note that everything here is completely non-portable, there is just no portable way to input/output Unicode strings on the terminal.
#ifndef UNICODE
#define UNICODE
#endif
#ifndef _UNICODE
#define _UNICODE
#endif
#define STRICT
#define NOMINMAX
#define WIN32_LEAN_AND_MEAN
#include <iostream>
#include <string>
#include <cstdlib>
#include <cstdio>
#include <conio.h>
#include <windows.h>
void testIostream();
void testStdio();
void testConio();
void testWindows();
int wmain() {
testIostream();
testStdio();
testConio();
testWindows();
std::system("pause");
}
void testIostream() {
std::wstring first, second;
std::getline(std::wcin, first);
if (!std::wcin.good()) return;
std::getline(std::wcin, second);
if (!std::wcin.good()) return;
std::wcout << first << second << std::endl;
}
void testStdio() {
wchar_t buffer[0x1000];
if (!_getws_s(buffer)) return;
const std::wstring first = buffer;
if (!_getws_s(buffer)) return;
const std::wstring second = buffer;
const std::wstring result = first + second;
_putws(result.c_str());
}
void testConio() {
wchar_t buffer[0x1000];
std::size_t numRead = 0;
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring first(buffer, numRead);
if (_cgetws_s(buffer, &numRead)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second + L'\n';
_cputws(result.c_str());
}
void testWindows() {
const HANDLE stdIn = GetStdHandle(STD_INPUT_HANDLE);
WCHAR buffer[0x1000];
DWORD numRead = 0;
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring first(buffer, numRead - 2);
if (!ReadConsoleW(stdIn, buffer, sizeof buffer, &numRead, NULL)) return;
const std::wstring second(buffer, numRead);
const std::wstring result = first + second;
const HANDLE stdOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD numWritten = 0;
WriteConsoleW(stdOut, result.c_str(), result.size(), &numWritten, NULL);
}
Edit 1: I've added a method based on conio.
Edit 2: I've messed around with _O_U16TEXT a bit as described in Michael Kaplan's blog, but that seemingly only had wgets interpret the (8-bit) data from ReadFile as UTF-16. I'll investigate this a bit further during the weekend.
If you have actual text (i.e., a string of logical characters), then insert to the wide streams instead. The wide streams will automatically encode your characters to match the bits expected by the locale encoding. (And if you have encoded bits instead, the streams will decode the bits, then re-encode them to match the locale.)
There is a lesser solution if you KNOW you have UTF-encoded bits (i.e., an array of bits intended to be decoded into a string of logical characters) AND you KNOW the target of the output stream is expecting that very same bit-format, then you can skip the decoding and re-encoding steps and write() the bits as-is. This only works when you know both sides use the same encoding format, which may be the case for small utilities not intended to communicate with processes in other locales.
It depends on the OS. If your OS understands you can simply send it UTF-8 sequences.