Regex: Replace "something" by a unicode character - regex

I am trying to figure out how to find a certain character and replace it with a Unicode character. In my example, I want to find all spaces (\s) and replace them with a narrow or thin space (e.g. Unicode U+2006).
Sample Text
8. 3. 2014
Search Pattern
(\d{1,2}\.)(\s?)(\d{1,2}\.)(\s?)(\d{2,4})
Replacement Pattern
$1{UNICODE}$3{UNICODE}$5
For some reason I cannot replace by(!) a Unicode character, I can only search for one.
I am working with a RegEx App called »RegExRX 3« to test my strings. In the end, I want to be able to use it with Adobe’s InDesign GREP functionality.
I know I could just copy and paste the correct whitespace into place but I am interested in how to do it with a Unicode character.
Thanks in advance!

InDesign uses Perl-compatible regular expressions (pcre). Getting a Unicode character into the replacement string is done by \x{XXXX} where XXXX is the hexadecimal character code:
$1\x{2009}$2\x{2009}$5
But in general you can replace by any character you can type. Just put actual thin spaces into your search-and-replace dialog:
$1 $3 $5
You can use your OS's utilities to grab the thin space from the list of available characters, for Windows it's the "Character Map" tool, where the thin space can be found in the "General Punctuation" Unicode sub-range. Searching for "thin space" works as well. MacOS has the "Character Viewer", which can do the same thing.

Related

Match Unicode character with regular expression

I can use regular expressions in VBA for Word 2019:
Dim RegEx As New RegExp
Dim Matches As MatchCollection
RegEx.Pattern = "[\d\w]+"
Text = "HelloWorld"
Set Matches = RegEx.Execute(Text)
But how can I match all Unicode characters and all digits too?
\p{L} works fine for me in PHP, but this doesn't work for me in VBA for Word 2019.
I would like to find words with characters and digits. So in PHP I use for this [\p{L}\p{N}]+. Which pattern can I use for this in VBA?
Currently, I would like to match words with German character, like äöüßÄÖÜ. But maybe I need this for other languages too.
But how can I match all Unicode characters and all digits too?
"VBScript Regular Expressions 5.5" (which I am pretty sure you are using here) are not "VBA Regular Expressions", they are a COM library that you can use in - among other things - VBA. They do not support Unicode with the built-in metacharacters (such as \w) and they have no knowledge of Unicode character classes (such as \p{L}). But of course you can still match Unicode characters with them.
Direct Matches
The simplest way is of course to directly use the Unicode characters you search for in the pattern. VBA uses Unicode strings, so matching Unicode is not a problem per se. Representing Unicode in your VBA source code, which itself is not Unicode, is a different matter. But ChrW() can help with that.
Assuming you have a certain character you want to match,
RegEx.Pattern = ChrW(&h4E16) & ChrW(&h754C)
Set Matches = RegEx.Execute(Text)
Msgbox Matches(0)
The above uses hex numbers (&h...) and ChrW() to create the Unicode characters U+4E16 and U+754C (世界) at run-time. When they are in your text, they will be found. This is tedious, but it works well if you already know what words you're looking for.
Ranges
If you want to match character ranges, you can do that as well. Use the start point and end point of the range. For example, the basic block of the "CJK Unified Ideographs" range goes from U+4E00 to U+9FFF:
RegEx.Pattern = "[" + ChrW(&h4E00) & "-" & ChrW(&h9FFF) & "]+"
Set Matches = RegEx.Execute(Text)
Msgbox Matches(0)
So this creates a natural range just like [a-z]+ to span all of the CJK characters. You'd have to define which ranges you want to match, so it's less convenient has having built-in support, but nothing is stopping you.
Caveats
The above is about matching Characters inside of the BMP (Basic Multilingual Plane). Characters outside of the BMP, such as Emoji, is a lot more difficult because of the way these characters work in Unicode. It's still possible, but it's not going to be pretty.
There are multiple ways of representing the same character. For example, ä could be represented by its own, singluar code-point, or by a followed by a second code-point for the dots (U+0308 "◌̈"). Since there is no telling how your input string represents certain characters, you should look into Unicode Normalization to make strings uniform before you search in them. In VBA this can be done by using the Win32 API.
Helpers
You can research Unicode ranges manually, but since there are so many of them, it's easy to miss some. I remember a useful helper for manually picking Unicode ranges, which now still lives on the Internet Archive: http://web.archive.org/web/20191118224127/http://kourge.net/projects/regexp-unicode-block
It allows you to qickly build regexes that span multiple ranges. It's aimed at JavaScript, but it's easy enough to adapt the output for VBA code.

Why Unicode Character 'MINUS SIGN' (U+2212) is NOT in regex unicode group \p{Pd} (Dash_Punctuation)?

I'm trying to collect all dash-signs to use it while analyzing raw text data. I've found that Unicode regexp \p{Pd} should match all cases, but after all, it turned out that this character − doesn't match!
Here is more info about this char:
https://www.fileformat.info/info/unicode/char/2212/index.htm
Is it a bug or a feature? Practically it's not useful stuff.
The Unicode character U+2212 MINUS SIGN is a math-related symbol, and is probably not considered as a punctuation mark; for instance, it is matched by \p{Math} but not by \p{Punctuation} (which includes \p{Dash_Punctuation}).
You may want to try using \p{Dash} instead, and check whether it covers all your needs or not...
Ref: Properties for U+2212
Edit:
Here is an "official" list of all the characters having a Dash Unicode property: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Dash=Yes:], including the U+2212 MINUS SIGN character.
In Unicode 12.0, the JavaScript regular expression:
/\p{Dash}/u
would be equivalent to:
/[\u002D\u058A\u05BE\u1400\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]/

word start with uppercase (unicode) in laravel validation [duplicate]

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern:
// unicode letters, apostrophe, hyphen, space
$namePattern = "/^([\\p{L}'\\- ])+$/";
This is eventually passed to a call to preg_match(). As far as I can tell, this works with your vanilla ASCII alphabet, but seems to trip up on spicier characters like Ă or 张.
Is there something wrong with the pattern itself? Perhaps I'm expecting \p{L} to do more work than I think it does?
Or does it have something to do with the way input is being passed in? I'm not sure if it's relevant, but I did make sure to specify a UTF8 encoding on the form page.
I think the problem is much simpler than that: You forgot to specify the u modifier. The Unicode character properties are only available in UTF-8 mode.
Your regex should be:
// unicode letters, apostrophe, hyphen, space
$namePattern = '/^[-\' \p{L}]+$/u';
If you want to replace Unicode old pattern with new pattern you should write:
$text = preg_replace('/\bold pattern\b/u', 'new pattern', $text);
So the key here is u modifier
Note : Your server php version shoud be at least PHP 4.3.5
as mentioned here php.net | Pattern Modifiers
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This
modifier is available from PHP 4.1.0 or greater on Unix and from PHP
4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
Thanks AgreeOrNot who give me that key here preg_replace match whole word in arabic
I tried it and it worked in localhost but when I try it in remote server it didn't work, then I found that php.net start use u modifier in PHP 4.3.5. , I upgrade php version and it works
Its important to know that this method is very helpful for Arabic users (عربي) because - as I believe - unicode is the best encode for arabic language, and replacement will not work if you don't use the u modifier, see next example it should work with you
$text = preg_replace('/\bمرحبا بك\b/u', 'NEW', $text);
First of all, your life would be a lot easier if you'd use single apostrophes instead of double quotes when writing these -- you need only one backslash. Second, combining marks \pM should also be included. If you find a character not matched please find out its Unicode code point and then you can use http://www.fileformat.info/info/unicode/ to figure out where it is. I found http://hsivonen.iki.fi/php-utf8/ an invaluable tool when doing debugging with UTF-8 properties (don't forget to convert to hex before trying to look up: array_map('dechex', utf8ToUnicode($text))).
For example, Ă turns out to be http://www.fileformat.info/info/unicode/char/0102/index.htm and to be in Lu and so L should match it and it does match for me. The other character is http://www.fileformat.info/info/unicode/char/5f20/index.htm and is also isLetter and indeed matches for me. Do you have the Unicode character tables compiled in?
Anyone else looking here and not getting this to work, please note that /u will not produce consistent result with Unicode scripts across different PHP versions.
See example: https://3v4l.org/4hB9e
Related: Incosistent regex result for Thai characters across different PHP version
<?php preg_match('/[a-zığüşöç]/u',$title) ?>

VBA ActiveDocument.Range.Text in Unicode

In VBA (Word specifically), I'm trying to use the RegExp object to search through a long document. Some of the patterns I search for include unicode character (such as a non-breaking hyphen or non-breaking space). When I access the text via
ActiveDocument.Range.Text
I get the text but stripped of unicode characters (or at least some of them, ones that I need). For example, if the text ABC-123, where the hyphen is a non-breaking (or hard) hyphen, U+2011, when I access the text using ActiveDocument.Range.Text, it displays ABC123.
I thought perhaps it just displays it incorrectly, and that the character is really there, but all the search and replace I've done don't show it. Plus, when I regex the unicode character using \u2011, it doesn't find it.
Is there another way to access the document's full content, but intact with all the unicode characters?
UPDATE: I inspected the output of the ABC123, and it appears that the character is hidden. That is, Len(str) = 7 instead of 6, what you'd expect. The following shows what is happening:
Print Asc(Mid(str, 4, 1))
=> 30
ASCII character 30, or \u001e is a record separator. When I search for this, it finds this zero-length character. I tested a wider range of unicode characters (\u2000-\u201f) and interestingly they all are detected with the \u control sequence in the regex, except for \u2011, which changes to \u001e. Even the en-space (\u2002) and em-space (\u2003) are recognized. I haven't done it for all the unicode characters, but it seems odd that I have stumbled upon one of the few that don't register.
This isn't an answer, but a workaround. When using RegExp to search for unicode characters, most will be recognized in the ActiveDocument.Range.Text variable using a \uxxxx code. If not, open a new Word document. In the body, add some text that contains the unicode character (e.g. non-breaking hyphen). Then in VBA, use the immediates window to find the ASCII character code for that character:
Print Asc(Mid(ActiveDocument.Range.Text, <char_position>, 1))
This will tell you if it is actually there (if the character doesn't show up in strings). The code you get won't actually work for every unicode character, since some of them are converted to ASCII characters (e.g. en-quad \u2000 will return ASCII 32, space, when using the Asc() function on it. Luckily, you can regex \u2000 and it will find it.).
For the non-breaking hyphen, the code that works with regex is \u001e.

regex unicode character in vim

I'm being an idiot.
Someone cut and pasted some text from microsoft word into my lovely html files.
I now have these unicode characters instead of regular quote symbols, (i.e. quotes appear as <92> in the text)
I want to do a regex replace but I'm having trouble selecting them.
:%s/\u92/'/g
:%s/\u5C/'/g
:%s/\x92/'/g
:%s/\x5C/'/g
...all fail. My google-fu has failed me.
From :help regexp (lightly edited), you need to use some specific syntax to select unicode characters with a regular expression in Vim:
\%u match specified multibyte character (eg \%u20ac)
That is, to search for the unicode character with hex code 20AC, enter this into your search pattern:
\%u20ac
The full table of character search patterns includes some additional options:
\%d match specified decimal character (eg \%d123)
\%x match specified hex character (eg \%x2a)
\%o match specified octal character (eg \%o040)
\%u match specified multibyte character (eg \%u20ac)
\%U match specified large multibyte character (eg \%U12345678)
This solution might not address the problem as originally stated, but it does address a different but very closely related one and I think it makes a lot of sense to place it here.
I don't know in which version of Vim it was implemented, but I was working on 7.4 when I tried it.
When in Edit mode, the sequence to output unicode characters is: ctrl-v u xxxx where xxxx is the code point. For instance outputting the euro sign would be ctrl-v u 20ac.
I tried it in Command mode as well and it worked. That is, to replace all instances of "20 euro" in my document with "20 €", I'd do:
:%s/20 euro/20 <ctrl-v u 20ac>/gc
In the above <ctrl-v u 20ac> is not literal, it's the sequence of keys that will output the € character.