How to search all CJK chars in vim? - regex

I can search a CJK char (such as 小) by using a unicode code point:
/\%u5c0f
/[\u5c0f]
I cannot search all of CJK chars by using [\u4E00-\u9FFF], because vim manual says:
:help /[]
NOTE: The other backslash codes mentioned above do not work inside []!
Is these a way to do the job?

It seems that Vim ranges are somehow limited to the same high byte, because /[\u4E00-\u4eFF] works fine. If you don't mind the mess, try:
/[\u4e00-\u4eff\u4f00-\u4fff\u5000-\u50ff\u5100-\u51ff\u5200-\u52ff\u5300-\u53ff\u5400-\u54ff\u5500-\u55ff\u5600-\u56ff\u5700-\u57ff\u5800-\u58ff\u5900-\u59ff\u5a00-\u5aff\u5b00-\u5bff\u5c00-\u5cff\u5d00-\u5dff\u5e00-\u5eff\u5f00-\u5fff\u6000-\u60ff\u6100-\u61ff\u6200-\u62ff\u6300-\u63ff\u6400-\u64ff\u6500-\u65ff\u6600-\u66ff\u6700-\u67ff\u6800-\u68ff\u6900-\u69ff\u6a00-\u6aff\u6b00-\u6bff\u6c00-\u6cff\u6d00-\u6dff\u6e00-\u6eff\u6f00-\u6fff\u7000-\u70ff\u7100-\u71ff\u7200-\u72ff\u7300-\u73ff\u7400-\u74ff\u7500-\u75ff\u7600-\u76ff\u7700-\u77ff\u7800-\u78ff\u7900-\u79ff\u7a00-\u7aff\u7b00-\u7bff\u7c00-\u7cff\u7d00-\u7dff\u7e00-\u7eff\u7f00-\u7fff\u8000-\u80ff\u8100-\u81ff\u8200-\u82ff\u8300-\u83ff\u8400-\u84ff\u8500-\u85ff\u8600-\u86ff\u8700-\u87ff\u8800-\u88ff\u8900-\u89ff\u8a00-\u8aff\u8b00-\u8bff\u8c00-\u8cff\u8d00-\u8dff\u8e00-\u8eff\u8f00-\u8fff\u9000-\u90ff\u9100-\u91ff\u9200-\u92ff\u9300-\u93ff\u9400-\u94ff\u9500-\u95ff\u9600-\u96ff\u9700-\u97ff\u9800-\u98ff\u9900-\u99ff\u9a00-\u9aff\u9b00-\u9bff\u9c00-\u9cff\u9d00-\u9dff\u9e00-\u9eff\u9f00-\u9fff]

I played around with this quite a bit and in vim the following seems to find all the Kanji characters in my Kanji/Pinyin/English text:
[^!-~0-9 aāáǎăàeēéěèiīíǐĭìoōóǒŏòuūúǔùǖǘǚǜ]

Vim cannot actually do this by itself, since you aren’t given access to Unicode properties like \p{Han}.
As of Unicode v6.0, the range of codepoints for characters in the Han script is:
2E80-2E99 2E9B-2EF3 2F00-2FD5 3005-3005 3007-3007 3021-3029 3038-303B 3400-4DB5 4E00-9FCB F900-FA2D FA30-FA6D FA70-FAD9 20000-2A6D6 2A700-2B734 2B740-2B81D 2F800-2FA1D
Whereas with Unicode v6.1, the range of Han codepoints has changed to:
2E80-2E99 2E9B-2EF3 2F00-2FD5 3005-3005 3007-3007 3021-3029 3038-303B 3400-4DB5 4E00-9FCC F900-FA6D FA70-FAD9 20000-2A6D6 2A700-2B734 2B740-2B81D 2F800-2FA1D
I also seem to recall that Vim has difficulties expressing astral code points, which are needed for this to work correctly. For example, using the flexible \x{HHHHHH} notation from Java 7 or Perl, you would have:
[\x{2E80}-\x{2E99}\x{2E9B}-\x{2EF3}\x{2F00}-\x{2FD5}\x{3005}-\x{3005}\x{3007}-\x{3007}\x{3021}-\x{3029}\x{3038}-\x{303B}\x{3400}-\x{4DB5}\x{4E00}-\x{9FCC}\x{F900}-\x{FA6D}\x{FA70}-\x{FAD9}\x{20000}-\x{2A6D6}\x{2A700}-\x{2B734}\x{2B740}-\x{2B81D}\x{2F800}-\x{2FA1D}]
Notice that the last part of the range is \x{2F800}-\x{2FA1D}, which is beyond the BMP. But what you really need is \p{Han} (meaning, \p{Script=Han}). This again shows that regex dialects that don’t support at least Level 1 of UTS#18: Basic Unicode Support are inadequate for working with Unicode. Vim’s regexes are inadequate for basic Unicode work.
EDITED TO ADD
Here’s the program that dumps out the ranges of code points that apply to any given Unicode script.
#!/usr/bin/env perl
#
# uniscrange - given a Unicode script name, print out the ranges of code
# points that apply.
# Tom Christiansen <tchrist#perl.com>
use strict;
use warnings;
use Unicode::UCD qw(charscript);
for my $arg (#ARGV) {
print "$arg: " if #ARGV > 1;
dump_range($arg);
}
sub dump_range {
my($scriptname) = #_;
my $alist = charscript($scriptname);
unless ($alist) {
warn "Unknown script '$scriptname'\n";
return;
}
for my $aref (#$alist) {
my($start, $stop, $name) = #$aref;
die "got $name, not $scriptname\n" unless $name eq $scriptname;
printf "%04X-%04X ", $start, $stop;
}
print "\n";
}
Its answers depend on which version of Perl — and thus, which version of Unicode — you’re running it against.
$ perl5.8.8 ~/uniscrange Latin Greek
Latin: 0041-005A 0061-007A 00AA-00AA 00BA-00BA 00C0-00D6 00D8-00F6 00F8-01BA 01BB-01BB 01BC-01BF 01C0-01C3 01C4-0241 0250-02AF 02B0-02B8 02E0-02E4 1D00-1D25 1D2C-1D5C 1D62-1D65 1D6B-1D77 1D79-1D9A 1D9B-1DBF 1E00-1E9B 1EA0-1EF9 2071-2071 207F-207F 2090-2094 212A-212B FB00-FB06 FF21-FF3A FF41-FF5A
Greek: 0374-0375 037A-037A 0384-0385 0386-0386 0388-038A 038C-038C 038E-03A1 03A3-03CE 03D0-03E1 03F0-03F5 03F6-03F6 03F7-03FF 1D26-1D2A 1D5D-1D61 1D66-1D6A 1F00-1F15 1F18-1F1D 1F20-1F45 1F48-1F4D 1F50-1F57 1F59-1F59 1F5B-1F5B 1F5D-1F5D 1F5F-1F7D 1F80-1FB4 1FB6-1FBC 1FBD-1FBD 1FBE-1FBE 1FBF-1FC1 1FC2-1FC4 1FC6-1FCC 1FCD-1FCF 1FD0-1FD3 1FD6-1FDB 1FDD-1FDF 1FE0-1FEC 1FED-1FEF 1FF2-1FF4 1FF6-1FFC 1FFD-1FFE 2126-2126 10140-10174 10175-10178 10179-10189 1018A-1018A 1D200-1D241 1D242-1D244 1D245-1D245
$ perl5.10.0 ~/uniscrange Latin Greek
Latin: 0041-005A 0061-007A 00AA-00AA 00BA-00BA 00C0-00D6 00D8-00F6 00F8-01BA 01BB-01BB 01BC-01BF 01C0-01C3 01C4-0293 0294-0294 0295-02AF 02B0-02B8 02E0-02E4 1D00-1D25 1D2C-1D5C 1D62-1D65 1D6B-1D77 1D79-1D9A 1D9B-1DBE 1E00-1E9B 1EA0-1EF9 2071-2071 207F-207F 2090-2094 212A-212B 2132-2132 214E-214E 2184-2184 2C60-2C6C 2C74-2C77 FB00-FB06 FF21-FF3A FF41-FF5A
Greek: 0374-0375 037A-037A 037B-037D 0384-0385 0386-0386 0388-038A 038C-038C 038E-03A1 03A3-03CE 03D0-03E1 03F0-03F5 03F6-03F6 03F7-03FF 1D26-1D2A 1D5D-1D61 1D66-1D6A 1DBF-1DBF 1F00-1F15 1F18-1F1D 1F20-1F45 1F48-1F4D 1F50-1F57 1F59-1F59 1F5B-1F5B 1F5D-1F5D 1F5F-1F7D 1F80-1FB4 1FB6-1FBC 1FBD-1FBD 1FBE-1FBE 1FBF-1FC1 1FC2-1FC4 1FC6-1FCC 1FCD-1FCF 1FD0-1FD3 1FD6-1FDB 1FDD-1FDF 1FE0-1FEC 1FED-1FEF 1FF2-1FF4 1FF6-1FFC 1FFD-1FFE 2126-2126 10140-10174 10175-10178 10179-10189 1018A-1018A 1D200-1D241 1D242-1D244 1D245-1D245
You can use the corelist -a Unicode command to see which version of Unicode goes with which version of Perl. Here is selected output:
$ corelist -a Unicode
v5.8.8 4.1.0
v5.10.0 5.0.0
v5.12.2 5.2.0
v5.14.0 6.0.0
v5.16.0 6.1.0

I don't understand the "same high byte problem" but it seems like it does not apply (at least not for me, VIM 7.4) when you actually enter the character to build up the ranges.
I usually search from U+3400(㐀) to U+9FCC(鿌) to capture Chinese characters in Japanese texts.
U+3400(㐀) is beginning of "CJK Unified Ideographs Extension A"
U+4DC0 - U+4DFF "Yijing Hexagram Symbols" is in between but not excluded for the sake of simplicity.
U+9FCC(鿌) is the end of "CJK Unified Ideographs"
Please note that the Japanese writing uses "々" as a kanji repetition symbol which is not part of this block. You can find it in the Block "Japanese Symbols and Punctuation."
/[㐀-鿌]
A (almost?) complete set of Chinese characters with extensions
/[㐀-鿌豈-龎𠀀-𪘀]
This range includes:
CJK Unified Ideographs Extension A
Yijing Hexagram Symbols (shouldn't be part of it)
CJK Unified Ideographs (main part)
CJK Compatibility Ideographs
CJK Unified Ideographs Extension B,
CJK Unified Ideographs Extension C,
CJK Unified Ideographs Extension D,
CJK Compatibility Ideographs Supplement
Bonus for people working on content in Japanese language:
Hiragana goes from U+3041 to U+3096
/[ぁ-ゟ]
Katakana
/[゠-ヿ]
Kanji Radicals
/[⺀-⿕]
Japanese Symbols and Punctuation.
Note that this range also includes 々(repetition of last kanji) and 〆(abbreviation for shime「しめ」). You might want to add them to your range to find words.
[ -〿]
Miscellaneous Japanese Symbols and Characters
/[ㇰ-ㇿ㈠-㉃㊀-㍿]
Alphanumeric and Punctuation (Full Width)
[!-~]
sources:
http://www.fileformat.info/info/unicode/char/9fcc/index.htm
http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/comment-page-1/#comment-46891

In some simple cases, I use this to search chinese characters. It also matches Japanese, Russian characters and so on.
[^\x00-\xff]

Related

Use regular expression to match characters appearing in Traditional Chinese ONLY

\p{Han} can be used to match all Chinese characters (in the Han script), which mix both Simplified Chinese and Traditional Chinese.
Is there a regular expression to match only the characters unique in Traditional Chinese? In other words, match any Chinese characters except for the ones in Simplified Chinese. Things like (?!\p{Hans})\p{Hant}.
Furthermore, ideally, if the regular expression can also exclude Japanese Kanji, Korean Hanja, Vietnamese Chữ Nho and Chữ Nôm.
Because Traditional Chinese characters are not continuous on the Unicode table, there is unfortunately not a simple Regex rather than testing them one by one, unless things like \p{Hant} and \p{Hans} are supported by Regex.
Inspired by the answer^ pointed by #jdaz's comment, I wrote a Python script using hanzidentifier module to generate the Regex that matches the characters unique in Traditional Chinese&:
from typing import List, Tuple
from hanzidentifier import identify, TRADITIONAL
def main():
block = [
*range(0x4E00, 0x9FFF + 1), # CJK Unified Ideographs
*range(0x3400, 0x4DBF + 1), # CJK Unified Ideographs Extension A
*range(0x20000, 0x2A6DF + 1), # CJK Unified Ideographs Extension B
*range(0x2A700, 0x2B73F + 1), # CJK Unified Ideographs Extension C
*range(0x2B740, 0x2B81F + 1), # CJK Unified Ideographs Extension D
*range(0x2B820, 0x2CEAF + 1), # CJK Unified Ideographs Extension E
*range(0x2CEB0, 0x2EBEF + 1), # CJK Unified Ideographs Extension F
*range(0x30000, 0x3134F + 1), # CJK Unified Ideographs Extension G
*range(0xF900, 0xFAFF + 1), # CJK Compatibility Ideographs
*range(0x2F800, 0x2FA1F + 1), # CJK Compatibility Ideographs Supplement
]
block.sort()
result: List[Tuple[int, int]] = []
for point in block:
char = chr(point)
identify_result = identify(char)
if identify_result is TRADITIONAL:
# is traditional only, save into the result list
if len(result) > 0 and result[-1][1] + 1 == point:
# the current char is right after the last char, just update the range
result[-1] = (result[-1][0], point)
else:
result.append((point, point))
range_regexes: List[str] = []
# now we have a list of ranges, convert them into a regex
for start, end in result:
if start == end:
range_regexes.append(chr(start))
elif start + 1 == end:
range_regexes.append(chr(start))
range_regexes.append(chr(end))
else:
range_regexes.append(f'{chr(start)}-{chr(end)}')
# join them together and wrap into [] to form a regex set
regex_char_set = ''.join(range_regexes)
print(f'[{regex_char_set}]')
if __name__ == '__main__':
main()
This generates the Regex which I've posted here: https://regex101.com/r/FkkHQ1/5 (seems like Stack Overflow doesn't like me posting the generated Regex)
Note that because hanzidentifier uses CC-CEDICT, and especially it is not using the latest version of CC-CEDICT, definitely, some Traditional characters are not covered, but should be enough for the commonly used characters.
Japanese Kanji is a large set. Luckily, the Japanese Agency for Cultural Affairs has a list of commonly used Kanjis and thus I created this text file for the program to read. After excluding the commonly used Kanjis, I got this Regex: https://regex101.com/r/FkkHQ1/7
Unfortunately, I couldn't find a list of commonly used Korean Hanja. Especially, Hanja are rarely used nowadays. Vietnamese Chữ Nho and Chữ Nôm have almost been wiped out as well.
Footnote:
^: the Regex in that answer doesn't match all Simplified characters. To get a Regex that matches all Simplified characters (including the ones in Traditional Chinese as well), change if identify_result is TRADITIONAL to if identify_result is SIMPLIFIED or identify_result is BOTH, which gives us the Regex: https://regex101.com/r/FkkHQ1/6
&: this script doesn't filter Japanese Kanji, Korean Hanja, Vietnamese Chữ Nho or Chữ Nôm. You have to modify it to exclude them.

python re not handling hangul

I'm trying to extract hangul, english, number from string input.
hangul = re.compile('[^a-zA-Z0-9\u3131-\u3163\uac00-\ud7a3]+')
s = u'abcd 가나다라 1234'
print hangul.sub('', s)
this give me u'abcd1234'
why does it ignore \uac00-\ud7a3 ?
I'm the developer for python jamo. If you use Python 3,
then you can use functions such as jamo.is_hangul_char. Otherwise, you could use the source code to help you out (you're missing a few Korean characters in your regex).
If you don't want to miss out on some of the older Hangul jamo display characters, then you want to use 3131-\u3163\u3165-\u318E to match all Hangul compatibility jamo. If you're only concerned about modern display characters, then you would use \u3131-\u314E\u314F-\u3163 to match all modern Hangul compatibility jamo.
Use a Unicode string in the re.compile; otherwise, \u3163 is not treated as a Unicode escape.
Although not required, '' in the .sub should be Unicode as well. There is an implicit conversion to Unicode in Python 2 otherwise and Python 3 requires it.
#coding:utf8
import re
hangul = re.compile(u'[^a-zA-Z0-9\u3131-\u3163\uac00-\ud7a3]+')
s = u'abcd 가나다라 1234'
print hangul.sub(u'', s)
Output:
abcd가나다라1234

Removal of Diacritics using regex in JAVA [duplicate]

The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet.
For instance here are a few conversions:
ҥ->H
Ѷ->V
Ȳ->Y
Ǭ->O
Ƈ->C
tђє Ŧค๓เℓy --> the Family
...
and I saw that there are more than 20 versions of letter A/a. and I don't know how to classify them. They look like needles in the haystack.
The complete list of unicode chars is at http://www.ssec.wisc.edu/~tomw/java/unicode.html or http://unicode.org/charts/charindex.html . Just try scrolling down and see the variations of letters.
How can I convert all these with Java? Please help me :(
Reposting my post from How do I remove diacritics (accents) from a string in .NET?
This method works fine in java (purely for the purpose of removing diacritical marks aka accents).
It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
It's a part of Apache Commons Lang as of ver. 3.0.
org.apache.commons.lang3.StringUtils.stripAccents("Añ");
returns An
Also see http://www.drillio.com/en/software-development/java/removing-accents-diacritics-in-any-language/
Attempting to "convert them all" is the wrong approach to the problem.
Firstly, you need to understand the limitations of what you are trying to do. As others have pointed out, diacritics are there for a reason: they are essentially unique letters in the alphabet of that language with their own meaning / sound etc.: removing those marks is just the same as replacing random letters in an English word. This is before you even go onto consider the Cyrillic languages and other script based texts such as Arabic, which simply cannot be "converted" to English.
If you must, for whatever reason, convert characters, then the only sensible way to approach this it to firstly reduce the scope of the task at hand. Consider the source of the input - if you are coding an application for "the Western world" (to use as good a phrase as any), it would be unlikely that you would ever need to parse Arabic characters. Similarly, the Unicode character set contains hundreds of mathematical and pictorial symbols: there is no (easy) way for users to directly enter these, so you can assume they can be ignored.
By taking these logical steps you can reduce the number of possible characters to parse to the point where a dictionary based lookup / replace operation is feasible. It then becomes a small amount of slightly boring work creating the dictionaries, and a trivial task to perform the replacement. If your language supports native Unicode characters (as Java does) and optimises static structures correctly, such find and replaces tend to be blindingly quick.
This comes from experience of having worked on an application that was required to allow end users to search bibliographic data that included diacritic characters. The lookup arrays (as it was in our case) took perhaps 1 man day to produce, to cover all diacritic marks for all Western European languages.
Since the encoding that turns "the Family" into "tђє Ŧค๓เℓy" is effectively random and not following any algorithm that can be explained by the information of the Unicode codepoints involved, there's no general way to solve this algorithmically.
You will need to build the mapping of Unicode characters into latin characters which they resemble. You could probably do this with some smart machine learning on the actual glyphs representing the Unicode codepoints. But I think the effort for this would be greater than manually building that mapping. Especially if you have a good amount of examples from which you can build your mapping.
To clarify: a few of the substitutions can actually be solved via the Unicode data (as the other answers demonstrate), but some letters simply have no reasonable association with the latin characters which they resemble.
Examples:
"ђ" (U+0452 CYRILLIC SMALL LETTER DJE) is more related to "d" than to "h", but is used to represent "h".
"Ŧ" (U+0166 LATIN CAPITAL LETTER T WITH STROKE) is somewhat related to "T" (as the name suggests) but is used to represent "F".
"ค" (U+0E04 THAI CHARACTER KHO KHWAI) is not related to any latin character at all and in your example is used to represent "a"
String tested : ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß
Tested :
Output from Apache Commons Lang3 : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from ICU4j : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from JUnidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUUss (problem with Ý and another issue)
Output from Unidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUYss
The last choice is the best.
The original request has been answered already.
However, I am posting the below answer for those who might be looking for generic transliteration code to transliterate any charset to Latin/English in Java.
Naive meaning of tranliteration:
Translated string in it's final form/target charset sounds like the string in it's original form.
If we want to transliterate any charset to Latin(English alphabets), then ICU4(ICU4J library in java ) will do the job.
Here is the code snippet in java:
import com.ibm.icu.text.Transliterator; //ICU4J library import
public static String TRANSLITERATE_ID = "NFD; Any-Latin; NFC";
public static String NORMALIZE_ID = "NFD; [:Nonspacing Mark:] Remove; NFC";
/**
* Returns the transliterated string to convert any charset to latin.
*/
public static String transliterate(String input) {
Transliterator transliterator = Transliterator.getInstance(TRANSLITERATE_ID + "; " + NORMALIZE_ID);
String result = transliterator.transliterate(input);
return result;
}
If the need is to convert "òéışöç->oeisoc", you can use this a starting point :
public class AsciiUtils {
private static final String PLAIN_ASCII =
"AaEeIiOoUu" // grave
+ "AaEeIiOoUuYy" // acute
+ "AaEeIiOoUuYy" // circumflex
+ "AaOoNn" // tilde
+ "AaEeIiOoUuYy" // umlaut
+ "Aa" // ring
+ "Cc" // cedilla
+ "OoUu" // double acute
;
private static final String UNICODE =
"\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
+ "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
+ "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
+ "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
+ "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
+ "\u00C5\u00E5"
+ "\u00C7\u00E7"
+ "\u0150\u0151\u0170\u0171"
;
// private constructor, can't be instanciated!
private AsciiUtils() { }
// remove accentued from a string and replace with ascii equivalent
public static String convertNonAscii(String s) {
if (s == null) return null;
StringBuilder sb = new StringBuilder();
int n = s.length();
for (int i = 0; i < n; i++) {
char c = s.charAt(i);
int pos = UNICODE.indexOf(c);
if (pos > -1){
sb.append(PLAIN_ASCII.charAt(pos));
}
else {
sb.append(c);
}
}
return sb.toString();
}
public static void main(String args[]) {
String s =
"The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
System.out.println(AsciiUtils.convertNonAscii(s));
// output :
// The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
}
}
The JDK 1.6 provides the java.text.Normalizer class that can be used for this task.
See an example here
The problem with "converting" arbitrary Unicode to ASCII is that the meaning of a character is culture-dependent. For example, “ß” to a German-speaking person should be converted to "ss" while an English-speaker would probably convert it to “B”.
Add to that the fact that Unicode has multiple code points for the same glyphs.
The upshot is that the only way to do this is create a massive table with each Unicode character and the ASCII character you want to convert it to. You can take a shortcut by normalizing characters with accents to normalization form KD, but not all characters normalize to ASCII. In addition, Unicode does not define which parts of a glyph are "accents".
Here is a tiny excerpt from an app that does this:
switch (c)
{
case 'A':
case '\u00C0': // À LATIN CAPITAL LETTER A WITH GRAVE
case '\u00C1': // Á LATIN CAPITAL LETTER A WITH ACUTE
case '\u00C2': // Â LATIN CAPITAL LETTER A WITH CIRCUMFLEX
// and so on for about 20 lines...
return "A";
break;
case '\u00C6':// Æ LATIN CAPITAL LIGATURE AE
return "AE";
break;
// And so on for pages...
}
You could try using unidecode, which is available as a ruby gem and as a perl module on cpan. Essentially, it works as a huge lookup table, where each unicode code point relates to an ascii character or string.
There is no easy or general way to do what you want because it is just your subjective opinion that these letters look loke the latin letters you want to convert to. They are actually separate letters with their own distinct names and sounds which just happen to superficially look like a latin letter.
If you want that conversion, you have to create your own translation table based on what latin letters you think the non-latin letters should be converted to.
(If you only want to remove diacritial marks, there are some answers in this thread: How do I remove diacritics (accents) from a string in .NET? However you describe a more general problem)
I'm late to the party, but after facing this issue today, I found this answer to be very good:
String asciiName = Normalizer.normalize(unicodeName, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
Reference:
https://stackoverflow.com/a/16283863
Following Class does the trick:
org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter

How can I specify Cyrillic character ranges in a Python 3.2 regex?

Once upon a time, I found this question interesting.
Today I decided to play around with the text of that book.
I want to use the regular expression in this script. When I use the script on Cyrillic text, it wipes out all of the Cyrillic characters, leaving only punctuation and whitespace.
#!/usr/bin/env python3.2
# coding=UTF-8
import sys, re
for file in sys.argv[1:]:
f = open(file)
fs = f.read()
regexnl = re.compile('[^\s\w.,?!:;-]')
rstuff = regexnl.sub('', f)
f.close()
print(rstuff)
Something very similar has already been done in this answer.
Basically, I just want to be able to specify a set of characters that are not alphabetic, alphanumeric, or punctuation or whitespace.
This doesn't exactly answer your question, but the regex module has much much better unicode support than the built-in re module. e.g. regex supports the \p{Cyrillic} property and its negation \P{Cyrillic} (as well as a huge number of other unicode properties). Also, it handles unicode case-insensitivity correctly.
You can specify the unicode range pretty easily: \u0400-\u0500. See also here.
Here's an example with some text from the Russian wikipedia, and also a sentence from the English wikipedia containing a single word in cyrillic.
#coding=utf-8
import re
ru = u"Владивосток находится на одной широте с Сочи, однако имеет среднегодовую температуру почти на 10 градусов ниже."
en = u"Vladivostok (Russian: Владивосток; IPA: [vlədʲɪvɐˈstok] ( listen); Chinese: 海參崴; pinyin: Hǎishēnwǎi) is a city and the administrative center of Primorsky Krai, Russia"
cyril1 = re.findall(u"[\u0400-\u0500]+", en)
cyril2 = re.findall(u"[\u0400-\u0500]+", ru)
for x in cyril1:
print x
for x in cyril2:
print x
output:
Владивосток
------
Владивосток
находится
на
одной
широте
с
Сочи
однако
имеет
среднегодовую
температуру
почти
на
градусов
ниже
Addition:
Two other ways that should also work, and in a bit less hackish fashion than specifying a unicode range:
re.findall("(?u)\w+", text) should match Cyrillic as well as Latin word characters.
re.findall("\w+", text, re.UNICODE) is equivalent
So, more specifically for your problem:
* re.compile('[^\s\w.,?!:;-], re.UNICODE') should do the trick.
See here (point 7)
For practical reasons I suggest using the exact Modern Russian subset of glyphs, instead of general Cyrillic. This is because Russian websites never use the full Cyrillic subset, which includes Belarusian, Ukrainian, Slavonic and Macedonian glyphs. For historical reasons I am keeping "u\0463".
//Basic Cyr Unicode range for use on Russian websites.
0401,0406,0410,0411,0412,0413,0414,0415,0416,0417,0418,0419,041A,041B,041C,041D,041E,041F,0420,0421,0422,0423,0424,0425,0426,0427,0428,0429,042A,042B,042C,042D,042E,042F,0430,0431,0432,0433,0434,0435,0436,0437,0438,0439,043A,043B,043C,043D,043E,043F,0440,0441,0442,0443,0444,0445,0446,0447,0448,0449,044A,044B,044C,044D,044E,044F,0451,0462,0463
Using this subset on a multilingual website will save you 60% of bandwidth, in comparison to using the original full range, and will increase page loading speed accordingly.

Regular Expression To Anglicize String Characters?

Is there a common regular expression that replaces all known special characters in non-English languages:
é, ô, ç, etc.
with English characters:
e, o, c, etc.
¡⅁uoɹʍ puɐ ⅂IɅƎ
This cannot be done, and you should not want to do it! It’s offensive to the whole world, and it’s naïve to the point of ignorance to believe that façade rhymes with arcade, or that Cañon City, Colorado falls under canon law.
You could run the string through Unicode Normalization Form D and discard mark characters, but I am certainly not going to tell you how because it is evil and wrong. It is evil for reasons already outlined, and it is wrong because there are zillion cases it doesn’t address at all.
Study Material
Here are what you need to read up on:
Unicode Normalization Forms - UAX #15 This annex describes normalization forms for Unicode text. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation. This annex also provides examples, additional specifications regarding normalization of Unicode text, and information about conformance testing for Unicode normalization forms.
Canonical Equivalence in Applications - UTN #5 This document describes methods and formats for efficient processing of text under canonical equivalence, as defined in UAX #15 Unicode Normalization Forms [UAX15].
Unicode Collation Algorithm - UTS #10 This report is the specification of the Unicode Collation Algorithm (UCA), which details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.
You MUST learn how to compare strings in a way that makes sense, and mutilating them simply never makes any sense whatso [pəʇələp] ever.
You must never just compare unnormalized strings code point by code point, and if possible you need to take the language into account, since rules differ between them.
Practical Examples
No matter the programming language you’re using, it may also help you to read the documentation for Perl’s Unicode::Normalize, Unicode::Collate, and Unicode::Collate::Locale modules.
For example, to search for "MÜSS" in a text that has "muß" in it, you would do this:
my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
# (normalization => undef) is REQUIRED.
my $str = "Ich muß studieren Perl.";
my $sub = "MÜSS";
my $match;
if (my($pos,$len) = $Collator->index($str, $sub)) {
$match = substr($str, $pos, $len);
}
That will put "muß" into $match.
The Unicode::Collate::Module has support for tailoring to these locales:
af Afrikaans
ar Arabic
az Azerbaijani (Azeri)
be Belarusian
bg Bulgarian
ca Catalan
cs Czech
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fi Finnish
fil Filipino
fo Faroese
fr French
ha Hausa
haw Hawaiian
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
ko Korean [2]
lt Lithuanian
lv Latvian
mk Macedonian
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
pl Polish
ro Romanian
ru Russian
se Northern Sami
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sv Swedish
sw Swahili
tn Tswana
to Tonga
tr Turkish
uk Ukrainian
vi Vietnamese
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order)
zh__stroke Chinese (ideographs: stroke order)
You have a choice: you can do this right, or you can not do it at all. No one will thank you if you do it wrong.
Doing it right means taking UAX#15 and UTS#10 into account.
Nothing less is acceptable in this day and age. It’s not the 1960s any more, you know!
No, there is no such regex. Note that with a regex you "describe" a specific piece of text.
A certain regex implementation might provide the possibility to do replacements using regex, but these replacements are usually only performed by a single replacement: not replace a with a' and b with b' etc.
Perhaps the language you're working with has a method in its API to perform this kind of replacements, but it won't be using regex.
This task is what the iconv library is for. Find out how to use it in whichever language you're developing in.
Chances are your library already has a binding for it