How to use exetended unix characters in c++ in Visual studio? - c++

We are using a korean font and freetype library and trying to display a korean character. But it displays some other characters indtead of hieroglyph
Code:
std::wstring text3 = L"놈";
Is there any tricks to type the korean characters?

For maximum portability, I'd suggest avoiding encoding Unicode characters directly in your source code and using \u escape sequences instead. The character 놈 is Unicode code point U+B188, so you could write this as:
std::wstring text3 = L"\uB188";

The question is what is the encoding of the source code.
It is likely UTF-8, which is one of the reasons not to use wstring. Use regular string. For more information on my way of handling characters, see http://utf8everywhere.org.

Related

Inputting a string containing greek characters in linux

I have a function which returns a string.
I have to define that string with greek characters in the function itself and should return that string.
I am working on Linux platform and my code is in C++.
My function is as follows:
string gen_string()
{
string str = "αγρω";
return str;
}
But I am not able to give the input.
When I try to copy paste the greek characters I want, it is appearing as some garbage characters.
Can some one please help me with this?
Thanks in advance.
EDIT:
Thanks for all your response.
Its not about using the wstring or string.
When I copy the string to the vim to give it as input, it is appearing as something like this.
▒~^▒~T▒~A▒~A201604¸▒~B▒žMDF_F▒~S123▒~T▒~B▒▒~B▒
I also tried by keeping the text in the file and opening the text file from vim.
But still it's the same.
string is only for ASCII characters, I believe.
You have international, likely Unicode characters. Consider using std::wstring for a multibyte "wide" string.
If you mean copy from some text to the terminal input then how to do this depends on the terminal. If it's a gnome terminal you need to specify UTF-8 in the locale settings though I'm not sure if that would get you the Greek alphabet.
locale command will list the current locale setting in locale.conf. You likely want to change the LANG setting. A way to do this system wide is
localectl set-locale LANG=en_country_code.UTF-8
Change country_code. It's US for the United States but I don't know what the Greek code is. You may need to be root. To change it just for yourself modify
~/.config/locale.conf
(or $XDG_CONFIG_HOME/locale.conf or $HOME/.config/locale.conf).
whichever gets you to the locale.conf file. On most systems all of them do.

How to read the … character and french accents from a text file

I am given a text file that contains a couple character per line. I have to read it, line by line, and apply a lexical analyzer on each character. Then, I write my analysis in another file.
With the following code, I have no problem reading french accents, but I realized that the character '…' (this is one character not 3 dots) is turned into a '&'.
Note: My lexical analyzer must use strings, that's why I converted back the wstring to a string.
wfstream SourceFile;
ofstream ResultFile (ResultFileName);
locale utf8_locale(std::locale(), new codecvt_utf8<wchar_t>);
SourceFile.imbue(utf8_locale);
SourceFile.open(SourceFileName);
while(getline(SourceFile, wLineBuffer))
{
string LineBuffer( wLineBuffer.begin(), wLineBuffer.end() );
...
Edit: Raymond Chen figured that the character is lost because of my conversion from wstring to string.
So the new question is now : How do I convert from a wstring to a string without transforming the characters ?
Edit: file sample
"stringééé"
"ccccccccccccccccccccccccccccccccccccccccccccccccccccccccc"
Identificateur1
Identificateur2
// Commentaire22
/**/
/*
Autre commentaire
…
*/
You need a proper Unicode support library. Forget using the broken Standard functions. They were not designed to support Unicode, don't support Unicode, and cannot be extended to support it properly. Look into using ICU or Boost.Locale or something like that.

What is the equivalent of mb_convert_encoding in GCC?

Currently I do the following in PHP to convert from UTF-8 to ASCII, but with the UTF-8 characters as HTML encoded. For instance, the German ß character would be converted to ß.
$sData = mb_convert_encoding($sData, 'HTML-ENTITIES','UTF-8');
In GCC C++ on Linux, what is the equivalent way to do this?
Actually, I can also take an alternative form like, instead of ß, the &#xNNN; format will do.

Use regular expression to match ANY Chinese character in utf-8 encoding

For example, I want to match a string consisting of m to n Chinese characters, then I can use:
[single Chinese character regular expression]{m,n}
Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?
The regex to match a Chinese (well, CJK) character is
\p{script=Han}
which can be appreviated to simply
\p{Han}
This assumes that your regex compiler meets requirement RL1.2 Properties from UTS#18 Unicode Regular Expressions. Perl and Java 7 both meet that spec, but many others do not.
In Java,
\p{InCJK_UNIFIED_IDEOGRAPHS}{1,3}
In C#
new Regex(#"\p{IsCJKUnifiedIdeographs}")
Here it is in the Microsoft docs
And here's more info from Wikipedia: CJK Unified Ideographs
The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,976 basic Chinese characters in the range U+4E00 through U+9FEF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters are also used in Vietnam's Nôm script (now obsolete).
Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?
Recommendation
To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++ that is backwards compatible with Flex. RE/flex supports Unicode and works with Bison to build lexers and parsers.
You can write Unicode patterns (and UTF-8 regular expressions) in RE/flex specifications such as:
%option flex unicode
%%
[肖晗] { printf ("xiaohan/2\n"); }
%%
Use global %option unicode to enable Unicode. You can also use a local modifier (?u:) to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):
%option flex
%%
(?u:[肖晗]) { printf ("xiaohan/2\n"); }
(?u:\p{Han}) { printf ("Han character %s\n", yytext); }
. { printf ("8-bit character %d\n", yytext[0]); }
%%
Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.
Background
In plain old Flex I ended up defining ugly UTF-8 patterns to capture ASCII letters and UTF-8 encoded letters for a compiler project that required support for Unicode identifiers id:
digit [0-9]
alpha ([a-zA-Z_\xA8\xAA\xAD\xAF\xB2\xB5\xB7\xB8\xB9\xBA\xBC\xBD\xBE]|[\xC0-\xFF][\x80-\xBF]*|\\u([0-9a-fA-F]{4}))
id ({alpha})({alpha}|{digit})*
The alpha pattern supports ASCII letters, underscore, and Unicode code points that are used in identifiers (\p{L} etc). The pattern permits more Unicode code points than absolutely necessary to keep the size of this pattern manageable, so it trades compactness for some lack of accuracy and to permit UTF-8 overlong characters in some cases that are not valid UTF-8. If you are thinking about this approach than be wary about the problems and safety concerns. Use a Unicode-capable scanner generator instead, such as RE/flex.
Safety
When using UTF-8 directly in Flex patterns, there are several concerns:
Encoding your own UTF-8 patterns in Flex for matching any Unicode character may be prone to errors. Patterns should be restricted to characters in the valid Unicode range only. Unicode code points cover the range U+0000 to U+D7FF and U+E000 to U+10FFFF. The range U+D800 to U+DFFF is reserved for UTF-16 surrogate pairs and are invalid code points. When using a tool to convert a Unicode range to UTF-8, make sure to exclude invalid code points.
Patterns should reject overlong and other invalid byte sequences. Invalid UTF-8 should not be silently accepted.
To catch lexical input errors in your lexer will require a special . (dot) that matches valid and invalid Unicode, including UTF-8 overruns and invalid byte sequences, in order to produce an error message that the input is rejected. If you use dot as a "catch-all-else" to produce an error message, but your dot does not match invalid Unicode, then you lexer will hang ("scanner is jammed") or your lexer will ECHO rubbish characters on the output by the Flex "default rule".
Your scanner should recognize a UTF BOM (Unicode Byte Order Mark) in the input to switch to UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or BE).
As you point out, patterns such as [unicode characters] do not work at all with Flex because UTF-8 characters in a bracket list are multibyte characters and each single byte character can be matched but not the UTF-8 character.
See also invalid UTF encodings in the RE/flex user guide.
In Java 7 and up, the format should be: "\p{IsHan}"
Just solved a similar problem,
when you have too much stuff to match, is better use a negated-set and declare what you don't want to match like:
all but not numbers: ^[^0-9]*$
the second ^ will implement the negation
just like this:
package main
import (
"fmt"
"regexp"
)
func main() {
compile, err := regexp.Compile("\\p{Han}") // match one any Chinese character
if err != nil {
return
}
str := compile.FindString("hello 世界")
fmt.Println(str) // output: 世
}

Does Visual Studio 2010 Supports C++ Source Code in Unicode with Unicode Char in String Literal

I want to directly embed non-ASCII Unicode characters in string literals and use them in printf. This implies my source codes must be saved in utf-8 or utf-16. Visual Studio 2010 does support editing and saving C++ source files in either format. But when compiled & executed, it does not produce the correct unicode characters. Does the compiler support string literals with unicode characters embedded?
e.g.
wprintf(L" chinese characters:中文字\n"); the trailing chinese characters cannot be displayed
I don't have a Chinese version of Windows to test with, so this is complete speculation.
The console and file output functions are aware that files are not coded in UTF-16, so they attempt to convert the characters to a code page before output. Just as the default locale is "C" rather than anything based on your system settings, so too the default code page is probably an inappropriate one that does not include Chinese characters.
There is a function SetConsoleOutputCP to change the code page for the console. It is not clear if this function changes the code page used by the actual console window, or if it only affects conversions from Unicode within the program.
The easy way to test wide literals is to skip the formatting part of printf, and give your string straight to the OS: WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L" chinese characters:中文字", ....
It's possible that #pragma setlocale may be what you need.