Hebrew chars in C++ (cout<<char<<char;) - c++

I'm trying to work with hebrew chars in C++ , using Clion on mac.
char notification[140]={"א"}; //this is ALEF the first letter of Hebrew ABC.
for(int i=0; i < strlen(notification); i++) {
cout << (int)notification[i] << endl;
} //Here I want to see what is the ASCII code for this letter.
the output for this for is :
-41
-112
Though there is only 1 char entered.
cout << char(-41) << char(-112) << endl; // this one gives me the output of the letter ALEF
cout << char(-41) << char(-111) << endl; //gives the second letter of Hebrew ABC.
I can't understand how it works why there is 2 chars to present 1 hebrew char ?

You see the UTF8 code for "א". but apparently your terminal not support this charset or UTF8.
(-41,-112) = (0xd7, 0x90)
Look here for UTF8 hebrew characters
You need to find how to configure the terminal to support Hebrew charset and UTF8.
maybe this can help

There are several sub-problems here.
a)
You need your data in some Unicode format, instead of ASCII-based one-byte-characters. You have that already, but if not, no programming language feature of the world will do this automatically for you.
b)
As you have UTF8, depending on what you're doing, std::string etc. can handle the data well.
Eg.
input and output from/to files is ok
getting the used byte length is ok
(input/output to the terminal depends on the used terminal)
...
What is a problem is eg.
counting how much characters (not bytes) are there
accessing single characters with varname[number]
Stuff like Unicode normalization
... for such things, you'll need some more coding and/or external libs like ICU.
c)
Your terminal needs to support UTF8 if you want to print such stirngs directly to it (or read input from the user). This depends completely on the used OS and it's configuration, The C++ part can't help here. See eg. OS X Terminal UTF-8 issues

Related

getline() doesn't read accented characters correctly

I'm trying to get accented characters from user using getline() command, but it does not print them correctly.
I tried to include some libraries as locale, but it was in vain.
Here's my code:
#include <iostream>
#include <cstdlib>
#include <string>
#include <locale>
using namespace std;
class Pers {
public:
string name;
int age;
string weapon;
};
int main()
{
setlocale(LC_ALL, "");
Pers pers;
cout << "Say the name of your character: ";
getline(cin, pers.name);
cout << pers.name;
}
When I type: Mark Coração, this is what I get:
How do I fix it?
Actually, the problem does not come from getline().
std::cout (respectively std::cin) does not support special characters. For this, you have to use std::wcout (respectively std::wcin) which uses wide characters (the size of standard characters limits you to what you can find in the ascii table).You need to use bigger characters to store the special characters too, that is the case of wide characters.std::string handles standard characters, std::wstring handles wide characters.
A way to do this could be:
std::wstring a(L"Coração");
std::wcout << a << std::endl;
Output:
Coração
To make it work with getline():
std::wstring a;
getline(std::wcin, a)
std::wcout << a << std::endl;
I hope it can help.
There are 2 levels in a same problem. The problem is that you are using characters outside the ASCII charset. The 2 levels are:
how they are converted to narrow characters on input
how they will be displayed on output
Windows console is a rather disturbing application in that respect: it is able to internaly process UCS2 characters that is any unicode character in the Basic Multilingual Plane, said differently any character with a code point of at most 0xFFFF. On input into narrow characters, it tries to map any character not represented in the current charset to what it thinks is closer, on output, it just outputs the value of each byte in its current charset. So the most reliable way is to ensure that the current locale has a correct collating sequence and that the console has a correct code page (charset in Windows language). After seeing the displayed output, I assume that you are using the code page 437 which contains semi-graphics character but few non ascii ones.
As you only need Western European characters, I would advise you to use the code page 1252. It is a Windows variant of the standard Latin1 or ISO-8859-1 charset (characters with codepoint of at most 0xFF).
So if possible you should try to configure the system in a non english west eurapean language (Portugues would be fine, but French seems to be enough, so I would assume that Spanish would go too).
And you must configure the console in an correct code page: chcp 1252.
If it is not enough (I cannot currently test anything), you could try to use wide character (wstring, wcin, wcout). But without changing code page from 437, the console would not display accented character.

Printing smiley face c++

I'm trying to print out the smiley face (from ascii) based on the amount of times the user asks for it, but on the console output screen, it only shows a square with another one inside of it. Where have I gone wrong?
#include <iostream>
using namespace std;
int main()
{
int smile;
cout << "How many smiley faces do you want to see? ";
cin >> smile;
for (int i = 0; i < smile; i++)
{
cout << static_cast<char>(1) << "\t";
}
cout << endl;
return 0;
}
ASCII does not have smileys (so in ASCII you'll have :-) and you expect your reader to understand that as a smiley). But Unicode has several ones, e.g. ☺ (white smiling face, U+263A); see http://unicodeemoticons.com/ or http://www.unicode.org/emoji/charts/emoji-list.html for a nice table of them.
In 2017, it is reasonable to use UTF8 everywhere (in terminals & outputs). UTF-8 is a very common encoding for Unicode, and many Unicode characters are encoded in several bytes in UTF-8.
So in a terminal using UTF8, with a font with many characters available, since ☺ is UTF8 encoded as "\342\230\272", use:
for (int i = 0; i < smile; i++)
{
cout << "\342\230\272" << "\t";
}
In 2017, most "console" are terminal emulators because real terminals -like the mythical VT100- are today in museums, and you can at least configure these terminal emulators to use UTF-8 encoding. On many operating systems (notably most Linux distributions and MacOSX), they are using UTF-8 by default.
If your C++11 compiler accepts UTF8 in strings (and a UTF8 source file), as most do today, you could even have "☺" in your source code. To type that you'll often use some copy and paste technique from an outside source. On my Linux system I often use some Character Map utility (e.g. run charmap in a terminal) to get them.
In ASCII, the character of code 1 is a control character, the Start Of Heading. Perhaps you are confusing ASCII with CP437 which is no more used (but in 1980s encoded a smiley-thing at code 1).
You need to use Unicode and understand it. Today, in 2017, you cannot afford using other encodings (they are historical legacy for museums) externally. Of course if you use weird characters, you should document that the user of your program should use some font having them (but most common fonts used in terminal emulators accept a very wide part of Unicode, so that is not a problem in practice). However, on my Linux computers, many fonts are lacking U+1F642 Slightly Smiling Face (e.g. "\360\267\231\202" in a C++ program) which appeared only in Unicode7.0 in 2014.
Just do this in Visual Studio Code:
for print;
cout<<"\2";

Is it possible to cout an EM DASH on Linux and Windows? [duplicate]

This question already has answers here:
Output Unicode to console Using C++, in Windows
(5 answers)
Closed 7 years ago.
I haven't been able to find a way to cout a '—' character, whether I put that in the cout statement like this: cout << "—"; or use char(151), the program prints out a fuzzy undefined character. Do you guys see anything wrong with my code? Is couting a EM DASH even possible?
Edit: I've also tried wcout << L"—"; and std::wcout << wchar_t(0x2014);. Those both print nothing in my terminal.
First of all, EM DASH is an unicode character (just making sure you do know that).
Printing unicode characters depends on what you're printing to.
If you're printing to a Unix terminal (or an emulator), the terminal emulator is using an encoding that supports this character, and that encoding matches the compiler's execution encoding, then you can do what you just did above in your source code cout << "—";
If you're getting fuzzy undefined characters, it is possible that your terminal just doesn't support that character.
If you're in windows (where it is harder), you can do something like this (which is not portable):
#include <iostream>
#include <io.h>
#include <fcntl.h>
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"—";
}
There's no universal support for Unicode in C++ and in various terminals, so there won't be a portable solution.
The thing is that the Windows console uses codepages in console by default. It probably uses UTF-16 internally but will always convert to and from the current ANSI codepage when interacting with outside. So simply printing an UTF-16 code point like std::wcout << wchar_t(0x2014); won't work without any prior setup. You need to switch to UTF-8 by running chcp 65001 in the console or _setmode(_fileno(stdout), _O_U16TEXT); in code before printing the character out with
std::wcout << L"—";
It will not always work because of the worse Unicode support in Windows console. In many cases the characters don't appear due issues in the renderer or font, replacing with squares or ????. But in that case just copy the text out and paste to any Unicode text box then it will be displayed properly
If you're using Windows in English or some other Western European languages that use codepage 1252/ISO-8859-1 then you can print em-dash which is at the codepoint 151 simply by
cout << (char)151;
If it doesn't work then you're not on codepage 1252. You can change it to 1252 if possible or look up for em-dash in your codepage (if available)
On Linux things are much simpler because UTF-8 are used by default. So you can output the string as normal without resorting to std::wcout
std::cout << "—"; // need to make sure that std::string is in UTF-8
// or use std::cout << u8"—" to force the encoding
In fact you'll often get surprise results if you use wide strings on Linux. std::wcout << L"—" won't often work because of some possible bugs in libc
That said, Windows 10 console now supports UTF-8 perfectly and even allows to use UTF-8 as the locale so if you don't need to support Windows 7 then there's a universal method to print any Unicode strings:
std::cout << u8"—";

Opening Unicode text files in C++ and displaying their contents

Currently I am attempting to open a text file that was saved in Unicode format, copy it's contents to a wstring, and then display it on the console. Because I am trying to understand more about working with strings and opening files, I'm experimenting with it in a simple program. Here is the source.
int main()
{
std::wfstream myfile("C:\\Users\\Jacob\\Documents\\openfiletest.txt");
if(!myfile.is_open())
{
std::cout << "error" << std::endl;
}
else
{
std::cout << "opened" << std::endl;
}
std::wstring mystring;
myfile >> mystring;
std::wcout << mystring << std::endl;
system("PAUSE");
}
When I try to display it on the console it displays  ■W H Y when it should display WHY (really it's "WHY WONT YOU WORK", but ill worry about why it's incomplete later I guess).
In all honesty, using Unicode is not very important to me because this isn't a program that I will be selling (more for just my self). I do want to get familiar with it though because eventually I do plan on needing to knowledge of using Unicode in C++. I am also using boost file-system for working with directories and multithreading while using C++/cli for the GUI. My question(s): Should I really bother using Unicode if I don't need it at this point in time, If so how do I fix this problem, and are there and cross platform libraries for dealing with strings and files that use different Unicode encodings (windows with UTF-16 and Linux with UTF-32).
Also, any articles on Unicode in C++ or Unicode in general would be appreciated. Here is one that I found and it helped a little.The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Thanks.
EDIT: Here is another arcticle I just found that was useful Reading UTF-8 Strings with C++
That's a byte order mark. If you find one at the beginning of the file, just strip it.
And the spaces in between letters are probably because the console isn't very wide char friendly.
It displays just one word because myfile is a stream and operator>> extracts just one string separated by whitespaces from the stream. You might want to try the getline function.

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.
Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.
Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.
Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.
wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx