Use regular expression to match ANY Chinese character in utf-8 encoding - regex

For example, I want to match a string consisting of m to n Chinese characters, then I can use:
[single Chinese character regular expression]{m,n}
Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

The regex to match a Chinese (well, CJK) character is
\p{script=Han}
which can be appreviated to simply
\p{Han}
This assumes that your regex compiler meets requirement RL1.2 Properties from UTS#18 Unicode Regular Expressions. Perl and Java 7 both meet that spec, but many others do not.

In Java,
\p{InCJK_UNIFIED_IDEOGRAPHS}{1,3}

In C#
new Regex(#"\p{IsCJKUnifiedIdeographs}")
Here it is in the Microsoft docs
And here's more info from Wikipedia: CJK Unified Ideographs
The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,976 basic Chinese characters in the range U+4E00 through U+9FEF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters are also used in Vietnam's Nôm script (now obsolete).

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?
Recommendation
To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++ that is backwards compatible with Flex. RE/flex supports Unicode and works with Bison to build lexers and parsers.
You can write Unicode patterns (and UTF-8 regular expressions) in RE/flex specifications such as:
%option flex unicode
%%
[肖晗] { printf ("xiaohan/2\n"); }
%%
Use global %option unicode to enable Unicode. You can also use a local modifier (?u:) to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):
%option flex
%%
(?u:[肖晗]) { printf ("xiaohan/2\n"); }
(?u:\p{Han}) { printf ("Han character %s\n", yytext); }
. { printf ("8-bit character %d\n", yytext[0]); }
%%
Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.
Background
In plain old Flex I ended up defining ugly UTF-8 patterns to capture ASCII letters and UTF-8 encoded letters for a compiler project that required support for Unicode identifiers id:
digit [0-9]
alpha ([a-zA-Z_\xA8\xAA\xAD\xAF\xB2\xB5\xB7\xB8\xB9\xBA\xBC\xBD\xBE]|[\xC0-\xFF][\x80-\xBF]*|\\u([0-9a-fA-F]{4}))
id ({alpha})({alpha}|{digit})*
The alpha pattern supports ASCII letters, underscore, and Unicode code points that are used in identifiers (\p{L} etc). The pattern permits more Unicode code points than absolutely necessary to keep the size of this pattern manageable, so it trades compactness for some lack of accuracy and to permit UTF-8 overlong characters in some cases that are not valid UTF-8. If you are thinking about this approach than be wary about the problems and safety concerns. Use a Unicode-capable scanner generator instead, such as RE/flex.
Safety
When using UTF-8 directly in Flex patterns, there are several concerns:
Encoding your own UTF-8 patterns in Flex for matching any Unicode character may be prone to errors. Patterns should be restricted to characters in the valid Unicode range only. Unicode code points cover the range U+0000 to U+D7FF and U+E000 to U+10FFFF. The range U+D800 to U+DFFF is reserved for UTF-16 surrogate pairs and are invalid code points. When using a tool to convert a Unicode range to UTF-8, make sure to exclude invalid code points.
Patterns should reject overlong and other invalid byte sequences. Invalid UTF-8 should not be silently accepted.
To catch lexical input errors in your lexer will require a special . (dot) that matches valid and invalid Unicode, including UTF-8 overruns and invalid byte sequences, in order to produce an error message that the input is rejected. If you use dot as a "catch-all-else" to produce an error message, but your dot does not match invalid Unicode, then you lexer will hang ("scanner is jammed") or your lexer will ECHO rubbish characters on the output by the Flex "default rule".
Your scanner should recognize a UTF BOM (Unicode Byte Order Mark) in the input to switch to UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or BE).
As you point out, patterns such as [unicode characters] do not work at all with Flex because UTF-8 characters in a bracket list are multibyte characters and each single byte character can be matched but not the UTF-8 character.
See also invalid UTF encodings in the RE/flex user guide.

In Java 7 and up, the format should be: "\p{IsHan}"

Just solved a similar problem,
when you have too much stuff to match, is better use a negated-set and declare what you don't want to match like:
all but not numbers: ^[^0-9]*$
the second ^ will implement the negation

just like this:
package main
import (
"fmt"
"regexp"
)
func main() {
compile, err := regexp.Compile("\\p{Han}") // match one any Chinese character
if err != nil {
return
}
str := compile.FindString("hello 世界")
fmt.Println(str) // output: 世
}

Related

find if string starts with \U in Python 3.3

I have a string and I want to find out if it starts with \U.
Here is an example
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
I was trying this:
myStr.startswith('\\U')
but I get False.
How can I detect \U in a string?
The larger picture:
I have a list of strings, most of them are normal English word strings, but there are a few that are similar to what I have shown in myStr, how can I distinguish them?
The original string does not have the character \U. It has the unicode escape sequence \U0001f64c, which is a single Unicode character.
Therefore, it does not make sense to try to detect \U in the string you have given.
Trying to detect the \U in that string is similar to trying to detect \x in the C string "\x90".
It makes no sense because the interpreter has read the sequence and converted it. Of course, if you want to detect the first Unicode character in that string, that works fine.
myStr.startswith('\U0001f64c')
Note that if you define the string with a real \U, like this, you can detect it just fine. Based on some experimentation, I believe Python 2.7.6 defaults to this behavior.
myStr = r'\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
myStr.startswith('\\U') # Returns True.
Update: The OP requested a way to convert from the Unicode string into the raw string above.
I will show the solution in two steps.
First observe that we can view the raw hex for each character like this.
>>> [hex(ord(x)) for x in myStr]
['0x1f64c', '0x1f60d', '0x1f4a6', '0x1f445', '0x1f4af']
Next, we format it by using a format string.
formatString = "".join(r'\U%08x' for x in myStr)
output = formatString % tuple(myChars)
output.startswith("\\U") # Returns True.
Note of course that since we are converting a Unicode string and we are formatting it this way deliberately, it guaranteed to start with \U. However, I assume your actual application is not just to detect whether it starts with \U.
Update2: If the OP is trying to differentiate between "normal English" strings and "Unicode Strings", the above approach will not work, because all characters have a corresponding Unicode representation.
However, one heuristic you might use to check whether a string looks like ASCII is to just check whether the values of each character are outside the normal ASCII range. Assuming that you consider the normal ASCII range to be between 32 and 127 (You can take a look here and decide what you want to include.), you can do something like the following.
def isNormal(myStr):
myChars = [ord(x) for x in myStr]
return all(x < 128 and x > 31 for x in myChars)
This can be done in one line, but I separated it to make it more readable.
Your string:
myStr = '\U0001f64c\U0001f60d\U0001f4a6\U0001f445\U0001f4af'
is not a foraign language text. It is 5 Unicode characters, which are (in order):
PERSON RAISING BOTH HANDS IN CELEBRATION
SMILING FACE WITH HEART-SHAPED EYES
SPLASHING SWEAT SYMBOL
TONGUE
HUNDRED POINTS SYMBOL
If you want to get strings that only contain 'normal' characters, you can use something like this:
if re.search(r'[^A-Za-z0-9\s]', myStr):
# String contained 'weird' characters.
Note that this will also trip on characters like é, which will sometimes be used in English on words with a French origin.

How to read the … character and french accents from a text file

I am given a text file that contains a couple character per line. I have to read it, line by line, and apply a lexical analyzer on each character. Then, I write my analysis in another file.
With the following code, I have no problem reading french accents, but I realized that the character '…' (this is one character not 3 dots) is turned into a '&'.
Note: My lexical analyzer must use strings, that's why I converted back the wstring to a string.
wfstream SourceFile;
ofstream ResultFile (ResultFileName);
locale utf8_locale(std::locale(), new codecvt_utf8<wchar_t>);
SourceFile.imbue(utf8_locale);
SourceFile.open(SourceFileName);
while(getline(SourceFile, wLineBuffer))
{
string LineBuffer( wLineBuffer.begin(), wLineBuffer.end() );
...
Edit: Raymond Chen figured that the character is lost because of my conversion from wstring to string.
So the new question is now : How do I convert from a wstring to a string without transforming the characters ?
Edit: file sample
"stringééé"
"ccccccccccccccccccccccccccccccccccccccccccccccccccccccccc"
Identificateur1
Identificateur2
// Commentaire22
/**/
/*
Autre commentaire
…
*/
You need a proper Unicode support library. Forget using the broken Standard functions. They were not designed to support Unicode, don't support Unicode, and cannot be extended to support it properly. Look into using ICU or Boost.Locale or something like that.

CoreFoundation printing Unicode characters

I have the current code and it does seem to work except for the fact CFShow doesn't translate the unicode UTF8 encoding of \u00e9 to é
#include <CoreFoundation/CoreFoundation.h>
int main()
{
char *s = "This is a test of unicode support: fiancée\n";
CFTypeRef cfs = CFStringCreateWithCString(NULL, s, kCFStringEncodingUTF8);
CFShow(cfs);
}
Output is
This is a test of unicode support: fianc\u00e9e
|____|
> é doesn't output properly.
How do I instruct CFShow that it is unicode? printf handles it fine when it is a c string.
CFShow() is only for debugging. It's deliberately converting non-ASCII to escape codes in order to avoid ambiguity. For example, "é" can be expressed in two ways: as U+00E9 LATIN SMALL LETTER E WITH ACUTE or as U+0065 LATIN SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. If CFShow() were to emit the UTF-8 sequence, your terminal would likely present it as "é" and you wouldn't be able to tell which variant was in the string. That would undermine the usefulness of CFShow() for debugging.
Why do you care what the output of CFShow() so long as it you understand what the content of the string is?
It appears to me that CFShow knows that the string is Unicode, but doesn't know how to format Unicode for the console. I doubt that you can do anything but look for an alternative, perhaps NSLog.

Range of UTF-8 Characters in C++11 Regex

This question is an extension of Do C++11 regular expressions work with UTF-8 strings?
#include <regex>
if (std::regex_match ("中", std::regex("中") )) // "\u4e2d" also works
std::cout << "matched\n";
The program is compiled on Mac Mountain Lion with clang++ with the following options:
clang++ -std=c++0x -stdlib=libc++
The code above works. This is a standard range regex "[一-龠々〆ヵヶ]" for matching any Japanese Kanji or Chinese character. It works in Javascript and Ruby, but I can't seem to get ranges working in C++11, even with using a similar version [\u4E00-\u9fa0]. The code below does not match the string.
if (std::regex_match ("中", std::regex("[一-龠々〆ヵヶ]")))
std::cout << "range matched\n";
Changing locale hasn't helped either. Any ideas?
EDIT
So I have found that all ranges work if you add a + to the end. In this case [一-龠々〆ヵヶ]+, but if you add {1} [一-龠々〆ヵヶ]{1} it does not work. Moreover, it seems to overreach it's boundaries. It won't match latin characters, but it will match は which is \u306f and ぁ which is \u3041. They both lie below \u4E00
nhahtdh also suggested regex_search which also works without adding + but it still runs into the same problem as above by pulling values outside of its range. Played with the locales a bit as well. Mark Ransom suggests it treats the UTF-8 string as a dumb set of bytes, I think this is possibly what it is doing.
Further pushing the theory that UTF-8 is getting jumbled some how, [a-z]{1} and [a-z]+ matches a, but only [一-龠々〆ヵヶ]+ matches any of the characters, not [一-龠々〆ヵヶ]{1}.
Encoded in UTF-8, the string "[一-龠々〆ヵヶ]" is equal to this one: "[\xe4\xb8\x80-\xe9\xbe\xa0\xe3\x80\x85\xe3\x80\x86\xe3\x83\xb5\xe3\x83\xb6]". And this is not the droid character class you are looking for.
The character class you are looking for is the one that includes:
any character in the range U+4E00..U+9FA0; or
any of the characters 々, 〆, ヵ, ヶ.
The character class you specified is the one that includes:
any of the "characters" \xe4 or \xb8; or
any "character" in the range \x80..\xe9; or
any of the "characters" \xbe, \xa0, \xe3, \x80, \x85, \xe3 (again), \x80 (again), \x86, \xe3 (again), \x83, \xb5, \xe3 (again), \x83 (again), \xb6.
Messy isn't it? Do you see the problem?
This will not match "latin" characters (which I assume you mean things like a-z) because in UTF-8 those all use a single byte below 0x80, and none of those is in that messy character class.
It will not match "中" either because "中" has three "characters", and your regex matches only one "character" out of that weird long list. Try assert(std::regex_match("中", std::regex("..."))) and you will see.
If you add a + it works because "中" has three of those "characters" in your weird long list, and now your regex matches one or more.
If you instead add {1} it does not match because we are back to matching three "characters" against one.
Incidentally "中" matches "中" because we are matching the three "characters" against the same three "characters" in the same order.
That the regex with + will actually match some undesired things because it does not care about order. Any character that can be made from that list of bytes in UTF-8 will match. It will match "\xe3\x81\x81" (ぁ U+3041) and it will even match invalid UTF-8 input like "\xe3\xe3\xe3\xe3".
The bigger problem is that you are using a regex library that does not even have level 1 support for Unicode, the bare minimum required. It munges bytes and there isn't much your precious tiny regex can do about it.
And the even bigger problem is that you are using a hardcoded set of characters to specify "any Japanese Kanji or Chinese character". Why not use the Unicode Script property for that?
R"(\p{Script=Han})"
Oh right, this won't work with C++11 regexes. For a moment there I almost forgot those are annoyingly worse than useless with Unicode.
So what should you do?
You could decode your input into a std::u32string and use char32_t all over for the matching. That would not give you this mess, but you would still be hardcoding ranges and exceptions when you mean "a set of characters that share a certain property".
I recommend you forget about C++11 regexes and use some regular expression library that has the bare minimum level 1 Unicode support, like the one in ICU.

How to use exetended unix characters in c++ in Visual studio?

We are using a korean font and freetype library and trying to display a korean character. But it displays some other characters indtead of hieroglyph
Code:
std::wstring text3 = L"놈";
Is there any tricks to type the korean characters?
For maximum portability, I'd suggest avoiding encoding Unicode characters directly in your source code and using \u escape sequences instead. The character 놈 is Unicode code point U+B188, so you could write this as:
std::wstring text3 = L"\uB188";
The question is what is the encoding of the source code.
It is likely UTF-8, which is one of the reasons not to use wstring. Use regular string. For more information on my way of handling characters, see http://utf8everywhere.org.