Regular expressions (regex) in Japanese - regex

I am learning about Regular expressions (regex) for English and although some of the concepts seem like they would apply to other languages such as Japanese, I feel as if many others would not. For example, a common use of regex is to find if a word has non alphanumeric characters. I don't see how this technique as well as others would work for Japanese as there are not only three writing systems, but kanji are also very complex and span a much greater range than alpha numeric characters do. I would appreciate any information on this topic as well as areas to look into more as I have very little knowledge on the subject although I have taken many Japanese courses. If at all possible, I would like your answers to use python and Java as those are the languages I am comfortable with. Thank you for your help.

Python regexes offer limited support for Unicode features. Java is better, particularly Java 7.
Java supports Unicode categories. E.g., \p{L} (and its shorthand, \pL) matches any letter in any language. This includes Japanese ideographic characters.
Java 7 supports Unicode scripts, including the Hiragana, Katakana, Han, and Latin scripts that Japanese text is typically composed of. You can match any character in one of these scripts using \p{Han}, \p{Hiragana}, \p{Katakana}, and \p{Latin}. You can combine them in a character class such as [\p{Han}\p{Hiragana}\p{Katakana}]. You can use an uppercase P (as in, \P{Han}) to match any character except those in the Han script.
Java 7 supports Unicode blocks. Unless running your code in Android (where scripts are not available), you should generally avoid blocks, since they are less useful and accurate than Unicode scripts. There are a variety of blocks related to Japanese text, including \p{InHiragana}, \p{InKatakana}, \p{InCJK_Unified_Ideographs}, \p{InCJK_Symbols_and_Punctuation}, etc.
Both Java and Python can refer to individual code points using \uFFFF, where FFFF is any four-digit headecimal number. Java 7 can refer to any Unicode code point, including those beyond the Basic Multilingual Plane, using e.g. \x{10FFFF}. Python regexes don't support 21-bit Unicode, but Python strings do, so you can embed a a code point in a regex using e.g. \U0010FFFF (uppercase U followed by eight hex digits).
The Java 7 (?U) or UNICODE_CHARACTER_CLASS flag makes character class shorthands like \w and \d Unicode aware, so they will match Japanese ideographic characters, etc. (but note that \d will still not match kanji for numbers like 一二三四). Python 3 makes shorthand classes Unicode aware by default. In Python 2, shorthand classes are Unicode aware when you use the re.UNICODE or re.U flag.
You're right that not all regex ideas carry over equally well to all scripts. Some things (such as letter casing) just don't make sense with Japanese text.

For Python
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
kanji = u'漢字'
hiragana = u'ひらがな'
katakana = u'カタカナ'
text = kanji + hiragana + katakana
#Match Kanji
regex = u'[\u4E00-\u9FFF]+' # == u'[一-龠々]+'
match = re.search(regex, text, re.U)
print match.group().encode('utf-8') #=> 漢字
#Match Hiragana
regex = u'[\u3040-\u309Fー]+' # == u'[ぁ-んー]+'
match = re.search(regex, text, re.U)
print match.group().encode('utf-8') #=> ひらがな
#Match Katakana
regex = u'[\u30A0-\u30FF]+' # == u'[ァ-ヾ]+'
match = re.search(regex, text, re.U)
print match.group().encode('utf-8') #=>カタカナ

The Java character classes do something like what you are looking for. They are the ones that start with \p here.

In Unicode there are two ways to classify characters from different writing systems. They are
Unicode Script (all characters used in a script, regardless of Unicode code points - may come from different blocks)
Unicode Block (code point ranges used for a specific purpose/script - may span across scripts and scripts may span across blocks)
The differences between these are explained rather more clearly on this web page from the official Unicode website.
In terms of matching characters in regular expressions in Java, you can use either classification mechanism since Java 7.
This is the syntax, as indicated in this tutorial from the Oracle website:
Script:
either \p{IsHiragana} or \p{script=Hiragana}
Block:
either \p{InHiragana} or \p{block=Hiragana}
Note that in one case it's "Is", in the other it's "In".
The syntax \p{Hiragana} indicated in the accepted answer does not seem to be a valid option. I tried it just in case but can confirm that it did not work for me.

Related

re.escape() equivalent in Julia?

I have a bunch of abbreviations I'd like to use in RegEx matches, but they contain lots of regex reserved characters (like . ? $).
In Python you're able to return an escaped (regex safe) string using re.escape. For example:
re.escape("Are U.S. Pythons worth any $$$?") will return 'Are\\ U\\.S\\.\\ Pythons\\ worth\\ any\\ \\$\\$\\$\\?'
From my (little) experience with Julia so far, I can tell there's probably a much more straightforward way of doing this in Julia, by I couldn't find any previous answers on the topic.
Julia uses the PCRE2 library underneath, and uses its regex-quoting syntax to automatically escape special characters when you join a Regex with a normal String. For eg.
julia> r"\w+\s*" * raw"Are U.S. Pythons worth any $$$?"
r"(?:\w+\s*)\QAre U.S. Pythons worth any $$$?\E"
Here we've used a raw string to make sure that none of the characters are interpreted as special, including the $s.
If we needed interpolation, we can also use a normal String literal instead. In this case, the interpolation will be done, and then the quoting with \Q ... \E.
julia> snake = "Python"
"Python"
julia> r"\w+\s*" * "Are U.S. $snake worth any money?"
r"(?:\w+\s*)\QAre U.S. Python worth any money?\E"
So you can place the part of the regex you wish to be quoted in a normal String, and they'll be quoted automatically when you join them up with a Regex.
You can even do it directly within the regex yourself - \Q starts a region where none of the regex-special characters are interpreted as special, and \E ends that region. Everything within such a region is treated literally by the regex engine.

Match Unicode character with regular expression

I can use regular expressions in VBA for Word 2019:
Dim RegEx As New RegExp
Dim Matches As MatchCollection
RegEx.Pattern = "[\d\w]+"
Text = "HelloWorld"
Set Matches = RegEx.Execute(Text)
But how can I match all Unicode characters and all digits too?
\p{L} works fine for me in PHP, but this doesn't work for me in VBA for Word 2019.
I would like to find words with characters and digits. So in PHP I use for this [\p{L}\p{N}]+. Which pattern can I use for this in VBA?
Currently, I would like to match words with German character, like äöüßÄÖÜ. But maybe I need this for other languages too.
But how can I match all Unicode characters and all digits too?
"VBScript Regular Expressions 5.5" (which I am pretty sure you are using here) are not "VBA Regular Expressions", they are a COM library that you can use in - among other things - VBA. They do not support Unicode with the built-in metacharacters (such as \w) and they have no knowledge of Unicode character classes (such as \p{L}). But of course you can still match Unicode characters with them.
Direct Matches
The simplest way is of course to directly use the Unicode characters you search for in the pattern. VBA uses Unicode strings, so matching Unicode is not a problem per se. Representing Unicode in your VBA source code, which itself is not Unicode, is a different matter. But ChrW() can help with that.
Assuming you have a certain character you want to match,
RegEx.Pattern = ChrW(&h4E16) & ChrW(&h754C)
Set Matches = RegEx.Execute(Text)
Msgbox Matches(0)
The above uses hex numbers (&h...) and ChrW() to create the Unicode characters U+4E16 and U+754C (世界) at run-time. When they are in your text, they will be found. This is tedious, but it works well if you already know what words you're looking for.
Ranges
If you want to match character ranges, you can do that as well. Use the start point and end point of the range. For example, the basic block of the "CJK Unified Ideographs" range goes from U+4E00 to U+9FFF:
RegEx.Pattern = "[" + ChrW(&h4E00) & "-" & ChrW(&h9FFF) & "]+"
Set Matches = RegEx.Execute(Text)
Msgbox Matches(0)
So this creates a natural range just like [a-z]+ to span all of the CJK characters. You'd have to define which ranges you want to match, so it's less convenient has having built-in support, but nothing is stopping you.
Caveats
The above is about matching Characters inside of the BMP (Basic Multilingual Plane). Characters outside of the BMP, such as Emoji, is a lot more difficult because of the way these characters work in Unicode. It's still possible, but it's not going to be pretty.
There are multiple ways of representing the same character. For example, ä could be represented by its own, singluar code-point, or by a followed by a second code-point for the dots (U+0308 "◌̈"). Since there is no telling how your input string represents certain characters, you should look into Unicode Normalization to make strings uniform before you search in them. In VBA this can be done by using the Win32 API.
Helpers
You can research Unicode ranges manually, but since there are so many of them, it's easy to miss some. I remember a useful helper for manually picking Unicode ranges, which now still lives on the Internet Archive: http://web.archive.org/web/20191118224127/http://kourge.net/projects/regexp-unicode-block
It allows you to qickly build regexes that span multiple ranges. It's aimed at JavaScript, but it's easy enough to adapt the output for VBA code.

word start with uppercase (unicode) in laravel validation [duplicate]

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:
// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";
This is eventually passed to a call to preg_match(). As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.
Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L} to do more work than I think it does?
Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.
I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.
Your regex should be:
// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';
If you want to replace Unicode old pattern with new pattern you should write:
$text = preg_replace('/\bold pattern\b/u', 'new pattern', $text);
So the key here is u modifier
Note : Your server php version shoud be at least PHP 4.3.5
as mentioned here php.net | Pattern Modifiers
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This
modifier is available from PHP 4.1.0 or greater on Unix and from PHP
4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
Thanks AgreeOrNot who give me that key here preg_replace match whole word in arabic
I tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use u modifier in PHP 4.3.5. , I upgrade php version and it works
Its important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the u modifier, see next example it should work with you
$text = preg_replace('/\bمرحبا بك\b/u', 'NEW', $text);
First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks \pM should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up: array_map('dechex', utf8ToUnicode($text))).
For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?
Anyone else looking here and not getting this to work, please note that /u will not produce consistent result with Unicode scripts across different PHP versions.
See example: https://3v4l.org/4hB9e
Related: Incosistent regex result for Thai characters across different PHP version
<?php preg_match('/[a-zığüşöç]/u',$title) ?>

RegEx: A way to handle both English and non-English characters (and my solution)

I would like to know if there is a recommendable RegEx pattern to match both English and non-English characters. So far I have come up with [^\x00-\x7F]+|[a-zA-Z'-]* based on the answer provided at SO. My solutions seems to work but since I am very nice to RegEx I would like to ask you to check this token and suggest some improvements. I am aware of most solutions that touch on this subject like this but I don't think there is already a good RegEx for this.
The answer depend mostly on the language. But in general, you have to enable the "unicode flag" (this is usually done by prepending (?u) to your regex, or by appending /u) and use unicode strings. This way, \w, \s and others will correctly match the corresponding unicode characters.
An example in Python 2 (Python 3 uses unicode by default):
>>> re.match('\w', 'è') # byte string, no unicode flag: no match
>>> re.match('(?u)\w', u'è') # unicode string and unicode flag: match
<_sre.SRE_Match object at 0x7f258bac07e8>
>>> re.match('\w', u'è', re.UNICODE) # another way to enable the unicode flag
<_sre.SRE_Match object at 0x7f258bac0850>

why regex match letter s in CJK Unified Ideographs Extension B unicode 20000-2A6DF?

based on this example What's the complete range for Chinese characters in Unicode?
does the letter "s" belog to this alphabet?
var r = /[\u20000-\u2A6DF]/;
var t = 'sad';
console.log(t.match(r))
outpus ["s"]
Why?
The regex you have contains astral code points:
Astral code points are pretty easy to recognize: if you need more than 4 hexadecimal digits to represent the code point, it’s an astral code point.
These code points are outside of Basic Multilingual Plane (BMP) that can be used in JavaScript regex (e.g. \u00XD).
However, JavaScript regex engine does not support astral code points (with the current ECMAScript implementation, it is already present in ECMAScript6, see Unicode code point escapes).
Thus, the problem arises when JavaScript regex engine tries to interpret the regex pattern: it "sees" \u2000, then 0, then -, then \u2A6D, then F inside your character class. Then, the engine creates a range between 0 and \u2A6D (⩭), which is a very large amount of characters, actually, and all English letters, and a lot more can be matched with this regex.
In the Javascript unicode string, chinese character but no punctuation post, you can find a comprehensive Chinese character regex for JavaScript that consists of possible Unicode code point combinations used in Chinese, but there are a couple of typos in it.
Here is a working snippet:
var r = /(?:[\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d])+/g;
var t = '我的中文不好。我是意大利人。你知道吗?';
console.log(t.match(r));