Use regular expression to match characters appearing in Traditional Chinese ONLY

Use regular expression to match characters appearing in Traditional Chinese ONLY - regex

\p{Han} can be used to match all Chinese characters (in the Han script), which mix both Simplified Chinese and Traditional Chinese.
Is there a regular expression to match only the characters unique in Traditional Chinese? In other words, match any Chinese characters except for the ones in Simplified Chinese. Things like (?!\p{Hans})\p{Hant}.
Furthermore, ideally, if the regular expression can also exclude Japanese Kanji, Korean Hanja, Vietnamese Chữ Nho and Chữ Nôm.

Because Traditional Chinese characters are not continuous on the Unicode table, there is unfortunately not a simple Regex rather than testing them one by one, unless things like \p{Hant} and \p{Hans} are supported by Regex.
Inspired by the answer^ pointed by #jdaz's comment, I wrote a Python script using hanzidentifier module to generate the Regex that matches the characters unique in Traditional Chinese&:
from typing import List, Tuple
from hanzidentifier import identify, TRADITIONAL
def main():
block = [
*range(0x4E00, 0x9FFF + 1), # CJK Unified Ideographs
*range(0x3400, 0x4DBF + 1), # CJK Unified Ideographs Extension A
*range(0x20000, 0x2A6DF + 1), # CJK Unified Ideographs Extension B
*range(0x2A700, 0x2B73F + 1), # CJK Unified Ideographs Extension C
*range(0x2B740, 0x2B81F + 1), # CJK Unified Ideographs Extension D
*range(0x2B820, 0x2CEAF + 1), # CJK Unified Ideographs Extension E
*range(0x2CEB0, 0x2EBEF + 1), # CJK Unified Ideographs Extension F
*range(0x30000, 0x3134F + 1), # CJK Unified Ideographs Extension G
*range(0xF900, 0xFAFF + 1), # CJK Compatibility Ideographs
*range(0x2F800, 0x2FA1F + 1), # CJK Compatibility Ideographs Supplement
]
block.sort()
result: List[Tuple[int, int]] = []
for point in block:
char = chr(point)
identify_result = identify(char)
if identify_result is TRADITIONAL:
# is traditional only, save into the result list
if len(result) > 0 and result[-1][1] + 1 == point:
# the current char is right after the last char, just update the range
result[-1] = (result[-1][0], point)
else:
result.append((point, point))
range_regexes: List[str] = []
# now we have a list of ranges, convert them into a regex
for start, end in result:
if start == end:
range_regexes.append(chr(start))
elif start + 1 == end:
range_regexes.append(chr(start))
range_regexes.append(chr(end))
else:
range_regexes.append(f'{chr(start)}-{chr(end)}')
# join them together and wrap into [] to form a regex set
regex_char_set = ''.join(range_regexes)
print(f'[{regex_char_set}]')
if __name__ == '__main__':
main()
This generates the Regex which I've posted here: https://regex101.com/r/FkkHQ1/5 (seems like Stack Overflow doesn't like me posting the generated Regex)
Note that because hanzidentifier uses CC-CEDICT, and especially it is not using the latest version of CC-CEDICT, definitely, some Traditional characters are not covered, but should be enough for the commonly used characters.
Japanese Kanji is a large set. Luckily, the Japanese Agency for Cultural Affairs has a list of commonly used Kanjis and thus I created this text file for the program to read. After excluding the commonly used Kanjis, I got this Regex: https://regex101.com/r/FkkHQ1/7
Unfortunately, I couldn't find a list of commonly used Korean Hanja. Especially, Hanja are rarely used nowadays. Vietnamese Chữ Nho and Chữ Nôm have almost been wiped out as well.
Footnote:
^: the Regex in that answer doesn't match all Simplified characters. To get a Regex that matches all Simplified characters (including the ones in Traditional Chinese as well), change if identify_result is TRADITIONAL to if identify_result is SIMPLIFIED or identify_result is BOTH, which gives us the Regex: https://regex101.com/r/FkkHQ1/6
&: this script doesn't filter Japanese Kanji, Korean Hanja, Vietnamese Chữ Nho or Chữ Nôm. You have to modify it to exclude them.

Related

Improve performance of large document text tokenization through Python + RegEx

I'm currently trying to process a large amount of very big (>10k words) text files. In my data pipeline, I identified the gensim tokenize function as my bottleneck, the relevant part is provided in my MWE below:
import re
import urllib.request
url='https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/genesis/english-web.txt'
doc=urllib.request.urlopen(url).read().decode('utf-8')
PAT_ALPHABETIC = re.compile('(((?![\d])\w)+)', re.UNICODE)
def tokenize(text):
text.strip()
for match in PAT_ALPHABETIC.finditer(text):
yield match.group()
def preprocessing(doc):
tokens = [token for token in tokenize(doc)]
return tokens
foo=preprocessing(doc)
Calling the preprocessing function for the given example takes roughly 66ms and I would like to improve this number. Is there anything I can still optimize in my code? Or is my hardware (Mid 2010s Consumer Notebook) the issue? I would be interested in the runtimes from people with some more recent hardware as well.
Thank you in advance

You may use
PAT_ALPHABETIC = re.compile(r'[^\W\d]+')
def tokenize(text):
for match in PAT_ALPHABETIC.finditer(text):
yield match.group()
Note:
\w matches letters, digits, underscores, some other connector punctuation and diacritics in Python 3.x by default, you do not need to use re.UNICODE or re.U options
To "exclude" (or "subtract") digit matching from \w, the ((?!\d)\w)+ looks an overkill, all you need to do is to "convert" the \w into an equivalent negated character class, [^\W], and add a \d there: [^\W\d]+.
Note the extraneous text.strip(): Python strings are immutable, if you do not assign a value to a variable, there is no use in text.strip(). Since whitespace in the input string do not interfere with the regex, [^\W\d]+, you may simply strip this text.strip() from your code.

Split using regex in python

I wan to split file name from the given file using regex function re.split.Please find the details below:
SVC_DC = 'JHN097567898_01102019_050514_svc_dc.tar"
My solution:
import regex as re
ans=re.split(os.sep,SVC_DC)
Error: re.error: bad escape (end of pattern) at position 0
Thanks in advance

If you want a filename, regexes are not your answer.
Python has the pathlib module dedicated to handling filepaths, and its objects, besides having methods to get the isolated filename handlign all possible corner-cases, also have methods to open, list files, and do everything one normally does to a file.
To get the base filename from a path, just use its automatic properties:
In [1]: import pathlib
In [2]: name = pathlib.Path("/home/user/JHN097567898_01102019_050514_svc_dc.tar")
In [3]: name.name
Out[3]: 'JHN097567898_01102019_050514_svc_dc.tar'
In [4]: name.parent
Out[4]: PosixPath('/home/user')
Otherwise, even if you would not use pathlib, os.path.sep being a single character, there would be no advantage in using re.split at all - normal string.split would do. Actually, there is os.path.split as well, that, predating pathlib, would always do the samething:
In [6]: name = "/home/user/JHN097567898_01102019_050514_svc_dc.tar"
In [7]: import os
In [8]: os.path.split(name)[-1]
Out[8]: 'JHN097567898_01102019_050514_svc_dc.tar'
And last (and in this case, actually least), the reason of the error is that you are on windows, and your os.path.sep character is "\" - this character alone is not a full regular expression, as the regex engine expects a character indicating a special sequence to come after the "\". For it to be used withour error, you'd need to do:
re.split(re.escape(os.path.sep), "myfilepath")

The reason of your failure are details concerning regular expressions,
namely the quotation issue.
E.g. under Windows os.sep = '\\', i.e. a single backslash.
But the backslash in regex has special meaning, just to escape special characters,
so in order to use it literally, you have to write it twice.
Try the following code:
import re
import os
SVC_DC = 'JHN097567898_01102019_050514_svc_dc.tar'
print(re.split(os.sep * 2, SVC_DC))
The result is:
['JHN097567898_01102019_050514_svc_dc.tar']
As the source string does not contain any backslashes, the result
is a list containing only one item (the whole source string).
Edit
To make the regex working under both Windows and Unix, you can try:
print(re.split('\\' + os.sep, SVC_DC))
Note that this regex contains:
a hard-coded backslash as the escape character,
the path separator used in the current operating system.
Note that the forward slash (in Unix case) does not require quotation,
but using quotation here is still acceptable (not needed, but working).

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing

In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)

The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)

To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

How can I specify Cyrillic character ranges in a Python 3.2 regex?

Once upon a time, I found this question interesting.
Today I decided to play around with the text of that book.
I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.
#!/usr/bin/env python3.2
# coding=UTF-8
import sys, re
for file in sys.argv[1:]:
f = open(file)
fs = f.read()
regexnl = re.compile('[^\s\w.,?!:;-]')
rstuff = regexnl.sub('', f)
f.close()
print(rstuff)
Something very similar has already been done in this answer.
Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.

This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.

You can specify the unicode range pretty easily: \u0400-\u0500. See also here.
Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.
#coding=utf-8
import re
ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"
cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)
for x in cyril1:
print x
for x in cyril2:
print x
output:
Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже
Addition:
Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:
re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
re.findall("\w+", text, re.UNICODE) is equivalent
So, more specifically for your problem:
* re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.
See here (point 7)

For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".
//Basic Cyr Unicode range for use on Russian websites.
0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463
Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.

How to search all CJK chars in vim?

I can search a CJK char (such as 小) by using a unicode code point:
/\%u5c0f
/[\u5c0f]
I cannot search all of CJK chars by using [\u4E00-\u9FFF], because vim manual says:
：help /[]
NOTE: The other backslash codes mentioned above do not work inside []!
Is these a way to do the job?

It seems that Vim ranges are somehow limited to the same high byte, because /[\u4E00-\u4eFF] works fine. If you don't mind the mess, try:
/[\u4e00-\u4eff\u4f00-\u4fff\u5000-\u50ff\u5100-\u51ff\u5200-\u52ff\u5300-\u53ff\u5400-\u54ff\u5500-\u55ff\u5600-\u56ff\u5700-\u57ff\u5800-\u58ff\u5900-\u59ff\u5a00-\u5aff\u5b00-\u5bff\u5c00-\u5cff\u5d00-\u5dff\u5e00-\u5eff\u5f00-\u5fff\u6000-\u60ff\u6100-\u61ff\u6200-\u62ff\u6300-\u63ff\u6400-\u64ff\u6500-\u65ff\u6600-\u66ff\u6700-\u67ff\u6800-\u68ff\u6900-\u69ff\u6a00-\u6aff\u6b00-\u6bff\u6c00-\u6cff\u6d00-\u6dff\u6e00-\u6eff\u6f00-\u6fff\u7000-\u70ff\u7100-\u71ff\u7200-\u72ff\u7300-\u73ff\u7400-\u74ff\u7500-\u75ff\u7600-\u76ff\u7700-\u77ff\u7800-\u78ff\u7900-\u79ff\u7a00-\u7aff\u7b00-\u7bff\u7c00-\u7cff\u7d00-\u7dff\u7e00-\u7eff\u7f00-\u7fff\u8000-\u80ff\u8100-\u81ff\u8200-\u82ff\u8300-\u83ff\u8400-\u84ff\u8500-\u85ff\u8600-\u86ff\u8700-\u87ff\u8800-\u88ff\u8900-\u89ff\u8a00-\u8aff\u8b00-\u8bff\u8c00-\u8cff\u8d00-\u8dff\u8e00-\u8eff\u8f00-\u8fff\u9000-\u90ff\u9100-\u91ff\u9200-\u92ff\u9300-\u93ff\u9400-\u94ff\u9500-\u95ff\u9600-\u96ff\u9700-\u97ff\u9800-\u98ff\u9900-\u99ff\u9a00-\u9aff\u9b00-\u9bff\u9c00-\u9cff\u9d00-\u9dff\u9e00-\u9eff\u9f00-\u9fff]

I played around with this quite a bit and in vim the following seems to find all the Kanji characters in my Kanji/Pinyin/English text:
[^!-~0-9 aāáǎăàeēéěèiīíǐĭìoōóǒŏòuūúǔùǖǘǚǜ]

Vim cannot actually do this by itself, since you aren’t given access to Unicode properties like \p{Han}.
As of Unicode v6.0, the range of codepoints for characters in the Han script is:
2E80-2E99 2E9B-2EF3 2F00-2FD5 3005-3005 3007-3007 3021-3029 3038-303B 3400-4DB5 4E00-9FCB F900-FA2D FA30-FA6D FA70-FAD9 20000-2A6D6 2A700-2B734 2B740-2B81D 2F800-2FA1D
Whereas with Unicode v6.1, the range of Han codepoints has changed to:
2E80-2E99 2E9B-2EF3 2F00-2FD5 3005-3005 3007-3007 3021-3029 3038-303B 3400-4DB5 4E00-9FCC F900-FA6D FA70-FAD9 20000-2A6D6 2A700-2B734 2B740-2B81D 2F800-2FA1D
I also seem to recall that Vim has difficulties expressing astral code points, which are needed for this to work correctly. For example, using the flexible \x{HHHHHH} notation from Java 7 or Perl, you would have:
[\x{2E80}-\x{2E99}\x{2E9B}-\x{2EF3}\x{2F00}-\x{2FD5}\x{3005}-\x{3005}\x{3007}-\x{3007}\x{3021}-\x{3029}\x{3038}-\x{303B}\x{3400}-\x{4DB5}\x{4E00}-\x{9FCC}\x{F900}-\x{FA6D}\x{FA70}-\x{FAD9}\x{20000}-\x{2A6D6}\x{2A700}-\x{2B734}\x{2B740}-\x{2B81D}\x{2F800}-\x{2FA1D}]
Notice that the last part of the range is \x{2F800}-\x{2FA1D}, which is beyond the BMP. But what you really need is \p{Han} (meaning, \p{Script=Han}). This again shows that regex dialects that don’t support at least Level 1 of UTS#18: Basic Unicode Support are inadequate for working with Unicode. Vim’s regexes are inadequate for basic Unicode work.
EDITED TO ADD
Here’s the program that dumps out the ranges of code points that apply to any given Unicode script.
#!/usr/bin/env perl
#
# uniscrange - given a Unicode script name, print out the ranges of code
# points that apply.
# Tom Christiansen <tchrist#perl.com>
use strict;
use warnings;
use Unicode::UCD qw(charscript);
for my $arg (#ARGV) {
print "$arg: " if #ARGV > 1;
dump_range($arg);
}
sub dump_range {
my($scriptname) = #_;
my $alist = charscript($scriptname);
unless ($alist) {
warn "Unknown script '$scriptname'\n";
return;
}
for my $aref (#$alist) {
my($start, $stop, $name) = #$aref;
die "got $name, not $scriptname\n" unless $name eq $scriptname;
printf "%04X-%04X ", $start, $stop;
}
print "\n";
}
Its answers depend on which version of Perl — and thus, which version of Unicode — you’re running it against.
$ perl5.8.8 ~/uniscrange Latin Greek
Latin: 0041-005A 0061-007A 00AA-00AA 00BA-00BA 00C0-00D6 00D8-00F6 00F8-01BA 01BB-01BB 01BC-01BF 01C0-01C3 01C4-0241 0250-02AF 02B0-02B8 02E0-02E4 1D00-1D25 1D2C-1D5C 1D62-1D65 1D6B-1D77 1D79-1D9A 1D9B-1DBF 1E00-1E9B 1EA0-1EF9 2071-2071 207F-207F 2090-2094 212A-212B FB00-FB06 FF21-FF3A FF41-FF5A
Greek: 0374-0375 037A-037A 0384-0385 0386-0386 0388-038A 038C-038C 038E-03A1 03A3-03CE 03D0-03E1 03F0-03F5 03F6-03F6 03F7-03FF 1D26-1D2A 1D5D-1D61 1D66-1D6A 1F00-1F15 1F18-1F1D 1F20-1F45 1F48-1F4D 1F50-1F57 1F59-1F59 1F5B-1F5B 1F5D-1F5D 1F5F-1F7D 1F80-1FB4 1FB6-1FBC 1FBD-1FBD 1FBE-1FBE 1FBF-1FC1 1FC2-1FC4 1FC6-1FCC 1FCD-1FCF 1FD0-1FD3 1FD6-1FDB 1FDD-1FDF 1FE0-1FEC 1FED-1FEF 1FF2-1FF4 1FF6-1FFC 1FFD-1FFE 2126-2126 10140-10174 10175-10178 10179-10189 1018A-1018A 1D200-1D241 1D242-1D244 1D245-1D245
$ perl5.10.0 ~/uniscrange Latin Greek
Latin: 0041-005A 0061-007A 00AA-00AA 00BA-00BA 00C0-00D6 00D8-00F6 00F8-01BA 01BB-01BB 01BC-01BF 01C0-01C3 01C4-0293 0294-0294 0295-02AF 02B0-02B8 02E0-02E4 1D00-1D25 1D2C-1D5C 1D62-1D65 1D6B-1D77 1D79-1D9A 1D9B-1DBE 1E00-1E9B 1EA0-1EF9 2071-2071 207F-207F 2090-2094 212A-212B 2132-2132 214E-214E 2184-2184 2C60-2C6C 2C74-2C77 FB00-FB06 FF21-FF3A FF41-FF5A
Greek: 0374-0375 037A-037A 037B-037D 0384-0385 0386-0386 0388-038A 038C-038C 038E-03A1 03A3-03CE 03D0-03E1 03F0-03F5 03F6-03F6 03F7-03FF 1D26-1D2A 1D5D-1D61 1D66-1D6A 1DBF-1DBF 1F00-1F15 1F18-1F1D 1F20-1F45 1F48-1F4D 1F50-1F57 1F59-1F59 1F5B-1F5B 1F5D-1F5D 1F5F-1F7D 1F80-1FB4 1FB6-1FBC 1FBD-1FBD 1FBE-1FBE 1FBF-1FC1 1FC2-1FC4 1FC6-1FCC 1FCD-1FCF 1FD0-1FD3 1FD6-1FDB 1FDD-1FDF 1FE0-1FEC 1FED-1FEF 1FF2-1FF4 1FF6-1FFC 1FFD-1FFE 2126-2126 10140-10174 10175-10178 10179-10189 1018A-1018A 1D200-1D241 1D242-1D244 1D245-1D245
You can use the corelist -a Unicode command to see which version of Unicode goes with which version of Perl. Here is selected output:
$ corelist -a Unicode
v5.8.8 4.1.0
v5.10.0 5.0.0
v5.12.2 5.2.0
v5.14.0 6.0.0
v5.16.0 6.1.0

I don't understand the "same high byte problem" but it seems like it does not apply (at least not for me, VIM 7.4) when you actually enter the character to build up the ranges.
I usually search from U+3400(㐀) to U+9FCC(鿌) to capture Chinese characters in Japanese texts.
U+3400(㐀) is beginning of "CJK Unified Ideographs Extension A"
U+4DC0 - U+4DFF "Yijing Hexagram Symbols" is in between but not excluded for the sake of simplicity.
U+9FCC(鿌) is the end of "CJK Unified Ideographs"
Please note that the Japanese writing uses "々" as a kanji repetition symbol which is not part of this block. You can find it in the Block "Japanese Symbols and Punctuation."
/[㐀-鿌]
A (almost?) complete set of Chinese characters with extensions
/[㐀-鿌豈-龎𠀀-𪘀]
This range includes:
CJK Unified Ideographs Extension A
Yijing Hexagram Symbols (shouldn't be part of it)
CJK Unified Ideographs (main part)
CJK Compatibility Ideographs
CJK Unified Ideographs Extension B,
CJK Unified Ideographs Extension C,
CJK Unified Ideographs Extension D,
CJK Compatibility Ideographs Supplement
Bonus for people working on content in Japanese language:
Hiragana goes from U+3041 to U+3096
/[ぁ-ゟ]
Katakana
/[゠-ヿ]
Kanji Radicals
/[⺀-⿕]
Japanese Symbols and Punctuation.
Note that this range also includes 々(repetition of last kanji) and 〆(abbreviation for shime「しめ」). You might want to add them to your range to find words.
[　-〿]
Miscellaneous Japanese Symbols and Characters
/[ㇰ-ㇿ㈠-㉃㊀-㍿]
Alphanumeric and Punctuation (Full Width)
[！-～]
sources:
http://www.fileformat.info/info/unicode/char/9fcc/index.htm
http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/comment-page-1/#comment-46891

In some simple cases, I use this to search chinese characters. It also matches Japanese, Russian characters and so on.
[^\x00-\xff]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Use regular expression to match characters appearing in Traditional Chinese ONLY - regex

Related

Improve performance of large document text tokenization through Python + RegEx

Split using regex in python

Replace all emojis from a given unicode string

How can I specify Cyrillic character ranges in a Python 3.2 regex?

How to search all CJK chars in vim?

Categories

Resources