Match Unicode character with regular expression - regex

I can use regular expressions in VBA for Word 2019:
Dim RegEx As New RegExp
Dim Matches As MatchCollection
RegEx.Pattern = "[\d\w]+"
Text = "HelloWorld"
Set Matches = RegEx.Execute(Text)
But how can I match all Unicode characters and all digits too?
\p{L} works fine for me in PHP, but this doesn't work for me in VBA for Word 2019.
I would like to find words with characters and digits. So in PHP I use for this [\p{L}\p{N}]+. Which pattern can I use for this in VBA?
Currently, I would like to match words with German character, like äöüßÄÖÜ. But maybe I need this for other languages too.

But how can I match all Unicode characters and all digits too?
"VBScript Regular Expressions 5.5" (which I am pretty sure you are using here) are not "VBA Regular Expressions", they are a COM library that you can use in - among other things - VBA. They do not support Unicode with the built-in metacharacters (such as \w) and they have no knowledge of Unicode character classes (such as \p{L}). But of course you can still match Unicode characters with them.
Direct Matches
The simplest way is of course to directly use the Unicode characters you search for in the pattern. VBA uses Unicode strings, so matching Unicode is not a problem per se. Representing Unicode in your VBA source code, which itself is not Unicode, is a different matter. But ChrW() can help with that.
Assuming you have a certain character you want to match,
RegEx.Pattern = ChrW(&h4E16) & ChrW(&h754C)
Set Matches = RegEx.Execute(Text)
Msgbox Matches(0)
The above uses hex numbers (&h...) and ChrW() to create the Unicode characters U+4E16 and U+754C (世界) at run-time. When they are in your text, they will be found. This is tedious, but it works well if you already know what words you're looking for.
Ranges
If you want to match character ranges, you can do that as well. Use the start point and end point of the range. For example, the basic block of the "CJK Unified Ideographs" range goes from U+4E00 to U+9FFF:
RegEx.Pattern = "[" + ChrW(&h4E00) & "-" & ChrW(&h9FFF) & "]+"
Set Matches = RegEx.Execute(Text)
Msgbox Matches(0)
So this creates a natural range just like [a-z]+ to span all of the CJK characters. You'd have to define which ranges you want to match, so it's less convenient has having built-in support, but nothing is stopping you.
Caveats
The above is about matching Characters inside of the BMP (Basic Multilingual Plane). Characters outside of the BMP, such as Emoji, is a lot more difficult because of the way these characters work in Unicode. It's still possible, but it's not going to be pretty.
There are multiple ways of representing the same character. For example, ä could be represented by its own, singluar code-point, or by a followed by a second code-point for the dots (U+0308 "◌̈"). Since there is no telling how your input string represents certain characters, you should look into Unicode Normalization to make strings uniform before you search in them. In VBA this can be done by using the Win32 API.
Helpers
You can research Unicode ranges manually, but since there are so many of them, it's easy to miss some. I remember a useful helper for manually picking Unicode ranges, which now still lives on the Internet Archive: http://web.archive.org/web/20191118224127/http://kourge.net/projects/regexp-unicode-block
It allows you to qickly build regexes that span multiple ranges. It's aimed at JavaScript, but it's easy enough to adapt the output for VBA code.

Related

Go regexp for printable characters

I have a Go server application that processes and saves a user-specified name. I really don't care what the name is; if they want it to be in hieroglyphs or emojis that's fine, as long as most clients can display it. Based on this question for C# I was hoping to use
^[^\p{Cc}\p{Cn}\p{Cs}]{1,50}$
basically 1-50 characters that are not control characters, unassigned characters, or partial UTF-16 characters. But Go does not support Cn. Basically I can't find a reasonable regexp that will match any printable unicode string but not "퟿͸", for example.
I want to use regex because the clients are not written in Go and I want to be able to precisely match the server validation. It's not clear how to match functions like isPrint in other languages.
Is there any way to do this other than hard-coding the unassigned unicode ranges into my application and separately checking for those?
You probably want to use just these Unicode character classes:
L (Letter)
M (Mark)
P (Punctuation)
S (Symbol)
That would give you this [positive] regular expression:
^[\pL\pM\pN\pP\pS]+$
Alternatively, test for those Unicode character classes which you don't want:
Z (Separator)
C (Other)
Again, a positive regular expression:
^[^\pZ\pC]+$

Regex: Replace "something" by a unicode character

I am trying to figure out how to find a certain character and replace it with a Unicode character. In my example, I want to find all spaces (\s) and replace them with a narrow or thin space (e.g. Unicode U+2006).
Sample Text
8. 3. 2014
Search Pattern
(\d{1,2}\.)(\s?)(\d{1,2}\.)(\s?)(\d{2,4})
Replacement Pattern
$1{UNICODE}$3{UNICODE}$5
For some reason I cannot replace by(!) a Unicode character, I can only search for one.
I am working with a RegEx App called »RegExRX 3« to test my strings. In the end, I want to be able to use it with Adobe’s InDesign GREP functionality.
I know I could just copy and paste the correct whitespace into place but I am interested in how to do it with a Unicode character.
Thanks in advance!
InDesign uses Perl-compatible regular expressions (pcre). Getting a Unicode character into the replacement string is done by \x{XXXX} where XXXX is the hexadecimal character code:
$1\x{2009}$2\x{2009}$5
But in general you can replace by any character you can type. Just put actual thin spaces into your search-and-replace dialog:
$1 $3 $5
You can use your OS's utilities to grab the thin space from the list of available characters, for Windows it's the "Character Map" tool, where the thin space can be found in the "General Punctuation" Unicode sub-range. Searching for "thin space" works as well. MacOS has the "Character Viewer", which can do the same thing.

VBA ActiveDocument.Range.Text in Unicode

In VBA (Word specifically), I'm trying to use the RegExp object to search through a long document. Some of the patterns I search for include unicode character (such as a non-breaking hyphen or non-breaking space). When I access the text via
ActiveDocument.Range.Text
I get the text but stripped of unicode characters (or at least some of them, ones that I need). For example, if the text ABC-123, where the hyphen is a non-breaking (or hard) hyphen, U+2011, when I access the text using ActiveDocument.Range.Text, it displays ABC123.
I thought perhaps it just displays it incorrectly, and that the character is really there, but all the search and replace I've done don't show it. Plus, when I regex the unicode character using \u2011, it doesn't find it.
Is there another way to access the document's full content, but intact with all the unicode characters?
UPDATE: I inspected the output of the ABC123, and it appears that the character is hidden. That is, Len(str) = 7 instead of 6, what you'd expect. The following shows what is happening:
Print Asc(Mid(str, 4, 1))
=> 30
ASCII character 30, or \u001e is a record separator. When I search for this, it finds this zero-length character. I tested a wider range of unicode characters (\u2000-\u201f) and interestingly they all are detected with the \u control sequence in the regex, except for \u2011, which changes to \u001e. Even the en-space (\u2002) and em-space (\u2003) are recognized. I haven't done it for all the unicode characters, but it seems odd that I have stumbled upon one of the few that don't register.
This isn't an answer, but a workaround. When using RegExp to search for unicode characters, most will be recognized in the ActiveDocument.Range.Text variable using a \uxxxx code. If not, open a new Word document. In the body, add some text that contains the unicode character (e.g. non-breaking hyphen). Then in VBA, use the immediates window to find the ASCII character code for that character:
Print Asc(Mid(ActiveDocument.Range.Text, <char_position>, 1))
This will tell you if it is actually there (if the character doesn't show up in strings). The code you get won't actually work for every unicode character, since some of them are converted to ASCII characters (e.g. en-quad \u2000 will return ASCII 32, space, when using the Asc() function on it. Luckily, you can regex \u2000 and it will find it.).
For the non-breaking hyphen, the code that works with regex is \u001e.

Regular expressions (regex) in Japanese

I am learning about Regular expressions (regex) for English and although some of the concepts seem like they would apply to other languages such as Japanese, I feel as if many others would not. For example, a common use of regex is to find if a word has non alphanumeric characters. I don't see how this technique as well as others would work for Japanese as there are not only three writing systems, but kanji are also very complex and span a much greater range than alpha numeric characters do. I would appreciate any information on this topic as well as areas to look into more as I have very little knowledge on the subject although I have taken many Japanese courses. If at all possible, I would like your answers to use python and Java as those are the languages I am comfortable with. Thank you for your help.
Python regexes offer limited support for Unicode features. Java is better, particularly Java 7.
Java supports Unicode categories. E.g., \p{L} (and its shorthand, \pL) matches any letter in any language. This includes Japanese ideographic characters.
Java 7 supports Unicode scripts, including the Hiragana, Katakana, Han, and Latin scripts that Japanese text is typically composed of. You can match any character in one of these scripts using \p{Han}, \p{Hiragana}, \p{Katakana}, and \p{Latin}. You can combine them in a character class such as [\p{Han}\p{Hiragana}\p{Katakana}]. You can use an uppercase P (as in, \P{Han}) to match any character except those in the Han script.
Java 7 supports Unicode blocks. Unless running your code in Android (where scripts are not available), you should generally avoid blocks, since they are less useful and accurate than Unicode scripts. There are a variety of blocks related to Japanese text, including \p{InHiragana}, \p{InKatakana}, \p{InCJK_Unified_Ideographs}, \p{InCJK_Symbols_and_Punctuation}, etc.
Both Java and Python can refer to individual code points using \uFFFF, where FFFF is any four-digit headecimal number. Java 7 can refer to any Unicode code point, including those beyond the Basic Multilingual Plane, using e.g. \x{10FFFF}. Python regexes don't support 21-bit Unicode, but Python strings do, so you can embed a a code point in a regex using e.g. \U0010FFFF (uppercase U followed by eight hex digits).
The Java 7 (?U) or UNICODE_CHARACTER_CLASS flag makes character class shorthands like \w and \d Unicode aware, so they will match Japanese ideographic characters, etc. (but note that \d will still not match kanji for numbers like 一二三四). Python 3 makes shorthand classes Unicode aware by default. In Python 2, shorthand classes are Unicode aware when you use the re.UNICODE or re.U flag.
You're right that not all regex ideas carry over equally well to all scripts. Some things (such as letter casing) just don't make sense with Japanese text.
For Python
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
kanji = u'漢字'
hiragana = u'ひらがな'
katakana = u'カタカナ'
text = kanji + hiragana + katakana
#Match Kanji
regex = u'[\u4E00-\u9FFF]+' # == u'[一-龠々]+'
match = re.search(regex, text, re.U)
print match.group().encode('utf-8') #=> 漢字
#Match Hiragana
regex = u'[\u3040-\u309Fー]+' # == u'[ぁ-んー]+'
match = re.search(regex, text, re.U)
print match.group().encode('utf-8') #=> ひらがな
#Match Katakana
regex = u'[\u30A0-\u30FF]+' # == u'[ァ-ヾ]+'
match = re.search(regex, text, re.U)
print match.group().encode('utf-8') #=>カタカナ
The Java character classes do something like what you are looking for. They are the ones that start with \p here.
In Unicode there are two ways to classify characters from different writing systems. They are
Unicode Script (all characters used in a script, regardless of Unicode code points - may come from different blocks)
Unicode Block (code point ranges used for a specific purpose/script - may span across scripts and scripts may span across blocks)
The differences between these are explained rather more clearly on this web page from the official Unicode website.
In terms of matching characters in regular expressions in Java, you can use either classification mechanism since Java 7.
This is the syntax, as indicated in this tutorial from the Oracle website:
Script:
either \p{IsHiragana} or \p{script=Hiragana}
Block:
either \p{InHiragana} or \p{block=Hiragana}
Note that in one case it's "Is", in the other it's "In".
The syntax \p{Hiragana} indicated in the accepted answer does not seem to be a valid option. I tried it just in case but can confirm that it did not work for me.

internationalized regular expression in postgresql

How can write regular expressions to match names like 'José' in postgres.. In other words I need to setup a constraint to check that only valid names are entered, but want to allow unicode characters also.
Regular expressions, unicode style have some reference on this. But, it seems I can't write it in postgres.
If it is not possible to write a regex for this, will it be sufficient to check only on client side using javascript
PostgreSQL doesn't support character classes based on the Unicode Character Database like .NET does. You get the more-standard [[:alpha:]] character class, but this is locale-dependent and probably won't cover it.
You may be able to get away with just blacklisting the ASCII characters you don't want, and allowing all non-ASCII characters. eg something like
[^\s!"#$%&'()*+,\-./:;<=>?\[\\\]^_`~]+
(JavaScript doesn't have non-ASCII character classes either. Or even [[:alpha:]].)
For example, given v_text as a text variable to be sanitzed:
-- Allow internationalized text characters and remove undesired characters
v_text = regexp_replace( lower(trim(v_text)), '[!"#$%&()*+,./:;<=>?\[\\\]\^_\|~]+', '', 'g' );