Farsi character utf8 in c++ - c++

i m trying to read and write Farsi characters in c++ and i want to show them in CMD
first thing i fix is Font i add Farsi Character to that and now i can write on the screen for example ب (uni : $0628) with this code:
#include <iostream>
#include <io.h>
#include <fcntl.h>
using namespace std;
int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
wcout << L"\u0628 \n";
wcout << L"ب"<<endl;
system("pause");
}
but how i can keep this character ... for Latin characters we can use char or string but how about Farsi character utf8 ?!
and how i can get them ... for Latin characters we use cin>>or gets_s
should i use wchar_t? if yes how?
because with this code it show wrong character ...
wchar_t a='\u0628';
wcout <<a;
and i can't show this character بـ (uni $FE91) even though that exist in my installed font but ب (uni $0628) showed correctly
thanks in advance

The solution is the following line:
wchar_t a=L'\u0628';
The use of L tells the compiler that your type char is a wide char ("large" type, I guess? At least that's how I remember it) and this makes sure the value doesn't get truncated to 8 bits - thus this works as intended.
UPDATE
If you are building/running this as a console application in Windows you need to manage your code pages accordingly. The following code worked for me when using Cyrillic input (Windows code page 1251) when I set the proper code page before wcin and cout calls, basically at the very top of my main():
SetConsoleOutputCP(1251);
SetConsoleCP(1251);
For Farsi I'd expect you should use code page 1256.
Full test code for your reference:
#include <iostream>
#include <Windows.h>
using namespace std;
void main()
{
SetConsoleOutputCP(1256); // to manage console output
SetConsoleCP(1256); // to properly process console input
wchar_t b;
wcin >> b;
wcout << b << endl;
}

Related

Garbled character output when using cout API

I am trying to run this simple code in VS 2015
#include "stdafx.h"
# include <iostream>
int main()
{
char * szOldPath = "\"C:\icm\scripts\StartupSync\runall.bat\" nonprod";
std::cout << szOldPath << std::endl;
return 0;
}
However, the output of szOldPath is not proper and the console is printing--
unall.bat" nonprodupSync
I suspect this might be because of Unicode and I should be using wcout. So I disabled Unicode by going to Configuration Properties -> General --> Character Set and tried setting it to Not Set or Multi Byte. But still running into this issue.
I understand it is not good to disable UNICODE but I am trying to understand some legacy code written in our company and this experiment is a part of this exercise,is there any way I can get the cout command to print szOldPath successfully?
Your issue has nothing to do with Unicode.
\r is the escape sequence for a carriage return. So, you are printing out "C:\icm\scripts\StartupSync, and then \r tells the terminal to move the cursor back to the beginning of the current line, and then unall.bat" nonprod is printed, overwriting what was already there.
You need to escape all of the \ characters in your string literal, just like you had to escape the " characters.
Also, your variable needs to be declared as a pointer to const char when assigning a string literal to the pointer. This is enforced in C++11 and later:
#include "stdafx.h"
#include <iostream>
int main()
{
const char * szOldPath = "\"C:\\icm\\scripts\\StartupSync\\runall.bat\" nonprod";
std::cout << szOldPath << std::endl;
return 0;
}
Alternatively, in C++11 and later, you can use a raw string literal instead to avoid having to escape any characters with a leading \:
const char * szOldPath = R"("C:\icm\scripts\StartupSync\runall.bat" nonprod)";

wcout does not output as desired

I've been trying to write a C++ application for a project and I ran into this issue. Basically:
class OBSClass
{
public:
wstring ClassName;
uint8_t Credit;
uint8_t Level;
OBSClass() : ClassName(), Credit(), Level() {}
OBSClass(wstring name, uint8_t credit, uint8_t hyear)
: ClassName(name), Credit(credit), Level(hyear)
{}
};
In some other file:
vector<OBSClass> AllClasses;
...
AllClasses.push_back(OBSClass(L"Bilişim Sistemleri Mühendisliğine Giriş", 3, 1));
AllClasses.push_back(OBSClass(L"İş Sağlığı ve Güvenliği", 3, 1));
AllClasses.push_back(OBSClass(L"Türk Dili 1", 2, 1));
... (rest omitted, some of entries have non-ASCII characters like 'ş' and 'İ')
I have a function basically outputs everything in AllClasses, the problem is wcout does not output as desired.
void PrintClasses()
{
for (size_t i = 0; i < AllClasses.size(); i++)
{
wcout << "Class: " << AllClasses[i].ClassName << "\n";
}
}
Output is 'Class: Bili' and nothing else. Program does not even tries to output other entries and just hangs. I am on windows using G++ 6.3.0. And I am not using Windows' cmd, I am using bash from mingw, so encoding will not be problem (or isn't it?). Any advice?
Edit: Also source code encoding is not a problem, just checked it is UTF8, default of VSCode
Edit: Also just checked to find out if problem is with string literals.
wstring test;
wcin >> test;
wcout << test;
Entered some non-ASCII characters like 'ö' and 'ş', it works perfectly. What is the problem with wide string literals?
Edit: Here you go
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<wstring> testvec;
int main()
{
testvec.push_back(L"Bilişim Sistemleri Mühendisliğine Giriş");
testvec.push_back(L"ıiÖöUuÜü");
testvec.push_back(L"☺☻♥♦♣♠•◘○");
for (size_t i = 0; i < testvec.size(); i++)
wcout << testvec[i] << "\n";
return 0;
}
Compile with G++:
g++ file.cc -O3
This code only outputs 'Bili'. It must be something with the g++ screwing up binary encoding (?), since entering values with wcin then outputting them with wcout does not generate any problem.
The following code works for me, using MinGW-w64 7.3.0 in both MSYS2 Bash, and Windows CMD; and with the source encoded as UTF-8:
#include <iostream>
#include <locale>
#include <string>
#include <codecvt>
int main()
{
std::ios_base::sync_with_stdio(false);
std::locale utf8( std::locale(), new std::codecvt_utf8_utf16<wchar_t> );
std::wcout.imbue(utf8);
std::wstring w(L"Bilişim Sistemleri Mühendisliğine Giriş");
std::wcout << w << '\n';
}
Explanation:
The Windows console doesn't support any sort of 16-bit output; it's only ANSI and a partial UTF-8 support. So you need to configure wcout to convert the output to UTF-8. This is the default for backwards compatibility purposes, though Windows 10 1803 does add an option to set that to UTF-8 (ref).
imbue with a codecvt_utf8_utf16 achieves this; however you also need to disable sync_with_stdio otherwise the stream doesn't even use the facet, it just defers to stdout which has a similar problem.
For writing to other files, I found the same technique works to write UTF-8. For writing a UTF-16 file you need to imbue the wofstream with a UTF-16 facet, see example here, and manually write a BOM.
Commentary: Many people just avoid trying to use wide iostreams completely, due to these issues.
You can write a UTF-8 file using a narrow stream; and have function calls in your code to convert wstring to UTF-8, if you are using wstring internally; you can of course use UTF-8 internally.
Of course you can also write a UTF-16 file using a narrow stream, just not with operator<< from a wstring.
If you have at least Windows 10 1903 (May 2019), and at least
Windows Terminal 0.3.2142 (Aug 2019). Then set Unicode:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"OEMCP"="65001"
and restart. After that you can use this:
#include <iostream>
int main() {
std::string a[] = {
"Bilişim Sistemleri Mühendisliğine Giriş",
"Türk Dili 1",
"İş Sağlığı ve Güvenliği",
"ıiÖöUuÜü",
"☺☻♥♦♣♠•◘○"
};
for (auto s: a) {
std::cout << s << std::endl;
}
}

Airplane symbol using Unicode C++

I'm trying to print the Airplane symbol using Unicode in my CodeBlocks. I found out that the code of Airplane is \u2708. So I tried the following code:
#include <iostream>
using namespace std;
int main() {
wchar_t a = '\u2708';
cout << a;
return 0;
}
It outputs 40072 when I replace wchar_t with char
char a = '\u2708';
I get this symbol: ł
Im really stuck, thanks for any help.
If you are in Linux and dont use codepage conversion from unicode to console try this:
std::locale::global(std::locale(""));
wchar_t plane = L'\u2708';
std::wcout << plane << std::endl;
In windows is a bit more complicated, you need a compatible unicode font on the default console and the correct codepage.

no output with wide streams

I have a problem with wide stream output. My primary concern is wofstream but wcout doesn't work properly either.
So it doesn't produce output besides Latin characters.
That is
#include <string>
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
wstring wstr = L"Андрей";
wofstream fout(L"C:\\Work\\report.htm");
wcout << wstr << L"Привет мир";
fout << wstr << L"Привет мир";
fout.close();
}
Produces no output, the file stays 0 byte long.
Mixing like wcout<<L"zuhщзг" prints just "zuh", ignores the rest.
I use MVS 2013 with Intel C++ Composer 14.0
EDIT:
Windows Unicode C++ Stream Output Failure describes similar problem. But I don't quite understand how the solution works.
MVS/Windows use UTF-16 for wide strings. and I would like they to be written in the file, as is, that is utf-16, without any unnecessary conversion

Matching russian vowels in C++

I wanted to write a function which returns true if a given character is a russian vowel. But the results I get are strange to me. This is what I've got so far:
#include <iostream>
using namespace std;
bool is_vowel_p(char working_char)
// returns true if the character is a russian vowel
{
string matcher = "аяё×эеуюыи";
if (find(matcher.begin(), matcher.end(), working_char) != matcher.end())
return true;
else
return false;
}
void main()
{
cout << is_vowel_p('е') << endl; // russian vowel
cout << is_vowel_p('Ж') << endl; // russian consonant
cout << is_vowel_p('D') << endl; // latin letter
}
The result is:
1
1
0
what is strange to me. I expected the following result:
1
0
0
It's seems that there is some kind of internal mechanism which I don't know yet. I'm at first interested in how to fix this function to work properly. And second, what is going on there, that I get this result.
string and char are only guaranteed to represent characters in the basic character set - which does not include the Cyrillic alphabet.
Using wstring and wchar_t, and adding L before the string and character literals to indicate that they use wide characters, should allow you to work with those letters.
Also, for portability you need to include <algorithm> for find, and give main a return type of int.
C++ source code is ASCII. You are entering unicode characters. The comparison is being done using 8 bit values. I bet one of the vowels fulfills the following:-
vowel & 255 == (code point for 'Ж') & 255
You need to use unicode functions to do this, not ASCII functions, i.e. use functions that require wchar_t values. Also, make sure your compiler can parse the non-ASCII vowel string. Using MS VC, the compiler requires:-
L"аяё×эеуюыи" or TEXT("аяё×эеуюыи")
the latter is a macro that adds the L when compiling with unicode support.
Convert the code to use wchar_t and it should work.
Very useful function in locale.h
setlocale(LC_ALL, "Russian");
Past this in the beginning of the program.
Example:
#include <stdio.h>
#include <locale.h>
void main()
{
setlocale(LC_ALL, "Russian");
printf("Здравствуй, мир!\n");//Hello, world!
}
Make sure your system default locale is Russian, and make sure your file is saved as codepage 1251 (Cyrillic/Windows). If it's saved as Unicode, this won't ever work.
The system default locale is the one used by non-Unicode-compliant programs. It's in Control Panel, under Regional settings.
Alternatively, rewritte to use wstring and wchar_t and L"" string/char literals.