Print unicode char

Print unicode char - c++

I tried a very simple code in C++:
#include <iostream>
#include <string>
int main()
{
std::wstring test = L"asdfa-";
test += u'ç';
std::wcout << test;
}
But the result was:
asdfa-?
It was not possible print 'ç', with cout or wcout, how can I can print this string correctally?
OS: Linux.
Ps: I use wstring instead of string, because sometimes I need calculate the length of the string, and this size must be the same of what is on the screen.
Ps: I need concatenate the unicode char, it can't be on the string constructor.

First, here's something that does work:
#include <iostream>
#include <string>
int main() {
std::string test = "asdfa-";
test += "ç";
std::cout << test;
}
I used just regular strings here and let C++ keep everything in UTF-8. I think you already know that this would work because you mentioned that you wanted to concatenate the ç rather than just leaving it in the string constructor.
Dealing with char, char16_t, char32_t, and wchar_t in C++ has never really been fun. You have to be careful with the L, u, and U prefixes.
However, where possible, if you deal with utf-8 strings, and avoid characters, you can generally get things to work much better. And since most consoles (with the possible exception of old Windows machines) understand utf-8 pretty well, this is the approach that often just works the best. So if you have wide characters, see if you can convert them to regular std::string objects and work in that domain.

One general way of handling this would be:
Input (convert from multibyte to wide using current locale)
Your App: work with wide strings
Output or saving to a file (convert from wide to multibyte)
For wide string manipulations like num of characters, substring etc. there is wcsXXX class of functions.

If you are using libstdc++ on Linux: you forgot an essential call at the beginning of the program
std::locale::global(std::locale(""));
This is assuming you are on Linux and your locale supports UTF-8.
If you are using libc++: forget about using wstreams. This library does not support I/O of wide characters in a useful way (i.e. translation to UTF-8 like libstdc++ does).
Windows has a wholly separate set of quirks regarding Unicode. You are lucky if you don't have to deal with them.
demo with gcc/libstdc++ and a call to std::locale
demo with gcc/libstdc++ and no call to std::locale
Different versions of clang/libc++ behave differently with this example: some output ? instead of the non-ascii char, some output nothing; some crash on call to std::locale, some don't. None do the right thing, which is printing the ç, or maybe I just haven't found one that works. I don't recommend using libc++ if you need anything related to locale or wchar_t.

I solved this problem using a conversion function:
#include <iostream>
#include <string>
#include <codecvt>
#include <locale>
std::string wstr2str(const std::wstring& wstr) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(wstr);
}
int main()
{
std::wstring test = L"asdfa-";
test += L'ç';
std::string str = wstr2str(test)
std::cout << str;
}

Related

How to implement console and file input/output functionality to work with UTF-8 encoding in MinGW? [duplicate]

This is the way I try to do it:
#include <stdio.h>
#include <windows.h>
using namespace std;
int main() {
SetConsoleOutputCP(CP_UTF8);
//german chars won't appear
char const* text = "aäbcdefghijklmnoöpqrsßtuüvwxyz";
int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
wchar_t *unicode_text = new wchar_t[len];
MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
wprintf(L"%s", unicode_text);
}
And the effect is that only us ascii chars are displayed. No errors are shown. The source file is encoded in utf8.
So, what I'm doing wrong here ?
to WouterH:
int main() {
SetConsoleOutputCP(CP_UTF8);
const wchar_t *unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
}
this also doesn't work. Effect is just the same. My font is of course Lucida Console.
third take:
#include <stdio.h>
#define _WIN32_WINNT 0x05010300
#include <windows.h>
#define _O_U16TEXT 0x20000
#include <fcntl.h>
using namespace std;
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
const wchar_t *u_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", u_text);
}
ok, something begins to work, but the output is: ańbcdefghijklmno÷pqrs▀tuŘvwxyz.

By default the wide print functions on Windows do not handle characters outside the ascii range.
There are a few ways to get Unicode data to the Windows console.
use the console API directly, WriteConsoleW. You'll have to ensure you're actually writing to a console and use other means when the output is to something else.
set the mode of the standard output file descriptors to one of the 'Unicode' modes, _O_U16TEXT or _O_U8TEXT. This causes the wide character output functions to correctly output Unicode data to the Windows console. If they're used on file descriptors that don't represent a console then they cause the output stream of bytes to be UTF-16 and UTF-8 respectively. N.B. after setting these modes the non-wide character functions on the corresponding stream are unusable and result in a crash. You must use only the wide character functions.
UTF-8 text can be printed directly to the console by setting the console output codepage to CP_UTF8, if you use the right functions. Most of the higher level functions such as basic_ostream<char>::operator<<(char*) don't work this way, but you can either use lower level functions or implement your own ostream that works around the problem the standard functions have.
The problem with the third method is this:
putc('\302'); putc('\260'); // doesn't work with CP_UTF8
puts("\302\260"); // correctly writes UTF-8 data to Windows console with CP_UTF8
Unlike most operating systems, the console on Windows is not simply another file that accepts a stream of bytes. It's a special device created and owned by the program and accessed via its own unique WIN32 API. The issue is that when the console is written to, the API sees exactly the extent of the data passed in that use of its API, and the conversion from narrow characters to wide characters occurs without considering that the data may be incomplete. When a multibyte character is passed using more than one call to the console API, each separately passed piece is seen as an illegal encoding, and is treated as such.
It ought to be easy enough to work around this, but the CRT team at Microsoft views it as not their problem whereas whatever team works on the console probably doesn't care.
You might solve it by implementing your own streambuf subclass which handles doing the conversion to wchar_t correctly. I.e. accounting for the fact that bytes of multibyte characters may come separately, maintaining conversion state between writes (e.g., std::mbstate_t).

Another trick, instead of SetConsoleOutputCP, would be using _setmode on stdout:
// Includes needed for _setmode()
#include <io.h>
#include <fcntl.h>
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
wchar_t * unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
return 0;
}
Don't forget to remove the call to SetConsoleOutputCP(CP_UTF8);

//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main() {
SetConsoleOutputCP(65001);
const char unicode_text[]="aäbcdefghijklmnoöpqrsßtuüvwxyz";
printf("%s\n", unicode_text);
}
Result:
aäbcdefghijklmnoöpqrsßtuüvwxyz

I had similar problems, but none of the existing answers worked for me. Something else I observed is that, if I stick UTF-8 characters in a plain string literal, they would print properly, but if I tried to use a UTF-8 literal (u8"text"), the characters get butchered by the compiler (proved by printing out their numeric values one byte at a time; the raw literal had the correct UTF-8 bytes, as verified on a Linux machine, but the UTF-8 literal was garbage).
After some poking around, I found the solution: /utf-8. With that, everything Just Works; my sources are UTF-8, I can use explicit UTF-8 literals, and output works with no other changes needed.

Console can be set to display UTF-8 chars: #vladasimovic answers SetConsoleOutputCP(CP_UTF8) can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001 or by system call system("chcp 65001 > nul") in the main program. Don't forget to save the source code in UTF-8 as well.
To check the UTF-8 support, run
#include <stdio.h>
#include <windows.h>
BOOL CALLBACK showCPs(LPTSTR cp) {
puts(cp);
return true;
}
int main() {
EnumSystemCodePages(showCPs,CP_SUPPORTED);
}
65001 should appear in the list.
Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (#Devenec suggests Lucida Console in his answer).
Why printf fails
As #bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printf messes the job, putting the bytes to output buffer one by one. Try use sprintf and then puts the result, or force to fflush only accumulated output buffer.
If everything fails
Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:
const char* ucshift(const char* str, int len=1) {
for(int i=0; i<len; ++i) {
if(*str==0) return str;
if(*str<0) {
unsigned char c = *str;
while((c<<=1)&128) ++str;
}
++str;
}
return str;
}
...and this function to transform the bytes into unicode number:
int ucchar(const char* str) {
if(!(*str&128)) return *str;
unsigned char c = *str, bytes = 0;
while((c<<=1)&128) ++bytes;
int result = 0;
for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
int mask = 1;
for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
result|= (*str&mask)<<(6*bytes);
return result;
}
Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale() before!)
or you can use your own mapping from Unicode table to your active working codepage. Example:
int main() {
system("chcp 65001 > nul");
char str[] = "příšerně"; // file saved in UTF-8
for(const char* p=str; *p!=0; p=ucshift(p)) {
int c = ucchar(p);
if(c<128) printf("%c\n",c);
else printf("%d\n",c);
}
}
This should print
p
345
237
353
e
r
n
283
If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.

UTF-8 doesn't work for Windows console. Period. I have tried all combinations with no success. Problems arise due to different ANSI/OEM character assignment so some answers say that there is no problem but such answers may come from programmers using 7-bit plain ASCII or have identical ANSI/OEM code pages (Chinese, Japanese).
Either you stick to use UTF-16 and the wide-char functions (but you are still restricted to the 256 characters of your OEM code page - except for Chinese/Japanese), or you use OEM code page ASCII strings in your source file.
Yes, it is a mess at all.
For multilingual programs I use string resources, and wrote a LoadStringOem() function that auto-translates the UTF-16 resource to OEM string using WideCharToMultiByte() without intermediate buffer. As Windows auto-selects the right language out of the resource, it will hopefully load a string in a language that is convertible to the target OEM code page.
As a consequence, you should not use 8-bit typographic characters for English-US language resource (as ellipsis … and quotes “”) as English-US is selected by Windows when no language match has been detected (i.e. fallback).
As an example you have resources in German, Czech, Russian, and English-US, and the user has Chinese, he/she will see English plus garbage instead of your nicely made typographic if you made your text nice-looking.
Now, on Windows 7 and 10, SetConsoleOutputCP(65001/*aka CP_UTF8*/) works as expected. You should keep your source file in UTF-8 without BOM, otherwise, your string literals will be recoded to ANSI by compiler. Moreover, the console font must contain desired characters, and must not be "Terminal". Unluckily, there is no font covering both umlauts and Chinese characters, even when you install both language packs, so you cannot truly display all character shapes at once.

I solved the problem in the following way:
Lucida Console doesn't seem to support umlauts, so changing the console font to Consolas, for example, works.
#include <stdio.h>
#include <Windows.h>
int main()
{
SetConsoleOutputCP(CP_UTF8);
// I'm using Visual Studio, so encoding the source file in UTF-8 won't work
const char* message = "a" "\xC3\xA4" "bcdefghijklmno" "\xC3\xB6" "pqrs" "\xC3\x9F" "tu" "\xC3\xBC" "vwxyz";
// Note the capital S in the first argument, when used with wprintf it
// specifies a single-byte or multi-byte character string (at least on
// Visual C, not sure about the C library MinGW is using)
wprintf(L"%S", message);
}
EDIT: fixed stupid typos and the decoding of the string literal, sorry about those.

Print unicode symbols to cmd [duplicate]

This is the way I try to do it:
#include <stdio.h>
#include <windows.h>
using namespace std;
int main() {
SetConsoleOutputCP(CP_UTF8);
//german chars won't appear
char const* text = "aäbcdefghijklmnoöpqrsßtuüvwxyz";
int len = MultiByteToWideChar(CP_UTF8, 0, text, -1, 0, 0);
wchar_t *unicode_text = new wchar_t[len];
MultiByteToWideChar(CP_UTF8, 0, text, -1, unicode_text, len);
wprintf(L"%s", unicode_text);
}
And the effect is that only us ascii chars are displayed. No errors are shown. The source file is encoded in utf8.
So, what I'm doing wrong here ?
to WouterH:
int main() {
SetConsoleOutputCP(CP_UTF8);
const wchar_t *unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
}
this also doesn't work. Effect is just the same. My font is of course Lucida Console.
third take:
#include <stdio.h>
#define _WIN32_WINNT 0x05010300
#include <windows.h>
#define _O_U16TEXT 0x20000
#include <fcntl.h>
using namespace std;
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
const wchar_t *u_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", u_text);
}
ok, something begins to work, but the output is: ańbcdefghijklmno÷pqrs▀tuŘvwxyz.

Another trick, instead of SetConsoleOutputCP, would be using _setmode on stdout:
// Includes needed for _setmode()
#include <io.h>
#include <fcntl.h>
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
wchar_t * unicode_text = L"aäbcdefghijklmnoöpqrsßtuüvwxyz";
wprintf(L"%s", unicode_text);
return 0;
}
Don't forget to remove the call to SetConsoleOutputCP(CP_UTF8);

//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>
int main() {
SetConsoleOutputCP(65001);
const char unicode_text[]="aäbcdefghijklmnoöpqrsßtuüvwxyz";
printf("%s\n", unicode_text);
}
Result:
aäbcdefghijklmnoöpqrsßtuüvwxyz

I had similar problems, but none of the existing answers worked for me. Something else I observed is that, if I stick UTF-8 characters in a plain string literal, they would print properly, but if I tried to use a UTF-8 literal (u8"text"), the characters get butchered by the compiler (proved by printing out their numeric values one byte at a time; the raw literal had the correct UTF-8 bytes, as verified on a Linux machine, but the UTF-8 literal was garbage).
After some poking around, I found the solution: /utf-8. With that, everything Just Works; my sources are UTF-8, I can use explicit UTF-8 literals, and output works with no other changes needed.

Console can be set to display UTF-8 chars: #vladasimovic answers SetConsoleOutputCP(CP_UTF8) can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001 or by system call system("chcp 65001 > nul") in the main program. Don't forget to save the source code in UTF-8 as well.
To check the UTF-8 support, run
#include <stdio.h>
#include <windows.h>
BOOL CALLBACK showCPs(LPTSTR cp) {
puts(cp);
return true;
}
int main() {
EnumSystemCodePages(showCPs,CP_SUPPORTED);
}
65001 should appear in the list.
Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (#Devenec suggests Lucida Console in his answer).
Why printf fails
As #bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printf messes the job, putting the bytes to output buffer one by one. Try use sprintf and then puts the result, or force to fflush only accumulated output buffer.
If everything fails
Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:
const char* ucshift(const char* str, int len=1) {
for(int i=0; i<len; ++i) {
if(*str==0) return str;
if(*str<0) {
unsigned char c = *str;
while((c<<=1)&128) ++str;
}
++str;
}
return str;
}
...and this function to transform the bytes into unicode number:
int ucchar(const char* str) {
if(!(*str&128)) return *str;
unsigned char c = *str, bytes = 0;
while((c<<=1)&128) ++bytes;
int result = 0;
for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
int mask = 1;
for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
result|= (*str&mask)<<(6*bytes);
return result;
}
Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale() before!)
or you can use your own mapping from Unicode table to your active working codepage. Example:
int main() {
system("chcp 65001 > nul");
char str[] = "příšerně"; // file saved in UTF-8
for(const char* p=str; *p!=0; p=ucshift(p)) {
int c = ucchar(p);
if(c<128) printf("%c\n",c);
else printf("%d\n",c);
}
}
This should print
p
345
237
353
e
r
n
283
If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.

I solved the problem in the following way:
Lucida Console doesn't seem to support umlauts, so changing the console font to Consolas, for example, works.
#include <stdio.h>
#include <Windows.h>
int main()
{
SetConsoleOutputCP(CP_UTF8);
// I'm using Visual Studio, so encoding the source file in UTF-8 won't work
const char* message = "a" "\xC3\xA4" "bcdefghijklmno" "\xC3\xB6" "pqrs" "\xC3\x9F" "tu" "\xC3\xBC" "vwxyz";
// Note the capital S in the first argument, when used with wprintf it
// specifies a single-byte or multi-byte character string (at least on
// Visual C, not sure about the C library MinGW is using)
wprintf(L"%S", message);
}
EDIT: fixed stupid typos and the decoding of the string literal, sorry about those.

What is the equivalent of `string` in C++

In Python, there is a type named string, what is the exact equivalent of python's string in C++?

The equivalent is std::string or std::wstring declared in the <string> header file.
Though you should note that python has probably different intrinsic behavior about handling automatic conversions to UNICODE strings, as mentioned in #Vincent Savard's comment.
To overcome these problems we use additional libraries in c++ like libiconv. It's available for use on a broad number of platforms.
You should seriously note to do some better research before asking at Stack Overflow, or ask your question more clearly. std::string is ubiquitous.

You could either use std::string (see available interface here: std::string)
or use char array(or const char*) to represent a basic combination of characters that might function as a primitive string.

Do you mean the std::string family?
#include <string>
int main() {
const std::string example = "test";
std::string exclaim = example + "!";
std::cout << exclaim << std::endl;
return 0;
}

Unicode range exceeds when try to print in C++

I am trying to print Unicode characters in C++. My Unicode characters are Old Turkic, I have the font. When I use a letter's code it gives me another characters. For example:
#include <iostream>
#include <string>
using namespace std;
int main()
{
string str = "\u10C00" // My character's unicode code.
cout << str << endl;
return 0;
}
This snipped gives an output of another letter with a 0 just after its end.
For example, it gives me this (lets assume that I want to print 'Ö' letter):
A0
But when I copied and pasted my actual letter to my source snippet, from character-map application in ubuntu, it gives me what I want. What is the problem here? I mean, I want use the character code way "\u10C00", but it doesn't work properly. I think this string is too long, so it uses the first 6 characters and pops out the 0 at the end. How can I fix this?

After escape /u must be exactly 4 hexadecimal characters. If you need more, you should use /U. The second variant takes 8 characters.
Example:
"\u00D6" // 'Ö' letter
"\u10C00" // incorrect escape code!
"\U00010C00" // your character

std::string does not really support unicode, use std::wstring instead.
but even std::wstring could have problems since it does not support all sizes.
an alternative would be to use some external string class such as Glib::ustring if you use gtkmm or QString in case of Qt.
Almost each GUI toolkit and other libraries provide it's own string class to handle unicode.

Wide to narrow characters

What is the cleanest way of converting a std::wstring into a std::string? I have used W2A et al macros in the past, but I have never liked them.

What you might be looking for is icu, an open-source, cross-platform library for dealing with Unicode and legacy encodings amongst many other things.

The most native way is std::ctype<wchar_t>::narrow(), but that does little more than std::copy as gishu suggested and you still need to manage your own buffers.
If you're not trying to perform any translation but just want a one-liner, you can do std::string my_string( my_wstring.begin(), my_wstring.end() ).
If you want actual encoding translation, you can use locales/codecvt or one of the libraries from another answer, but I'm guessing that's not what you're looking for.

Since this is one of the first results for a search of "c++ narrow string," and it is from before C++11, here is the C++11 way of solving this problem:
#include <codecvt>
#include <locale>
#include <string>
std::string narrow( const std::wstring& str ){
std::wstring_convert<
std::codecvt_utf8_utf16< std::wstring::value_type >,
std::wstring::value_type
> utf16conv;
return utf16conv.to_bytes( str );
}
std::wstring_convert: http://en.cppreference.com/w/cpp/locale/wstring_convert
std::codecvt_utf8_utf16: http://en.cppreference.com/w/cpp/locale/codecvt_utf8_utf16

If the encoding in the wstring is UTF-16 and you want conversion to a UTF-8 encoded string, you can use UTF8 CPP library:
utf8::utf16to8(wstr.begin(), wstr.end(), back_inserter(str));

See if this helps. This one uses std::copy to achieve your goal.
http://www.codeguru.com/forum/archive/index.php/t-193852.html

I don't know if it's the "cleanest" but I've used copy() function without any problems so far.
#include <iostream>
#include <algorithm>
using namespace std;
string wstring2string(const wstring & wstr)
{
string str(wstr.length(),’ ‘);
copy(wstr.begin(),wstr.end(),str.begin());
return str;
}
wstring string2wstring(const string & str)
{
wstring wstr(str.length(),L’ ‘);
copy(str.begin(),str.end(),wstr.begin());
return wstr;
}
http://agraja.wordpress.com/2008/09/08/cpp-string-wstring-conversion/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js