regex unicode character in vim - regex

I'm being an idiot.
Someone cut and pasted some text from microsoft word into my lovely html files.
I now have these unicode characters instead of regular quote symbols, (i.e. quotes appear as <92> in the text)
I want to do a regex replace but I'm having trouble selecting them.
:%s/\u92/'/g
:%s/\u5C/'/g
:%s/\x92/'/g
:%s/\x5C/'/g
...all fail. My google-fu has failed me.

From :help regexp (lightly edited), you need to use some specific syntax to select unicode characters with a regular expression in Vim:
\%u match specified multibyte character (eg \%u20ac)
That is, to search for the unicode character with hex code 20AC, enter this into your search pattern:
\%u20ac
The full table of character search patterns includes some additional options:
\%d match specified decimal character (eg \%d123)
\%x match specified hex character (eg \%x2a)
\%o match specified octal character (eg \%o040)
\%u match specified multibyte character (eg \%u20ac)
\%U match specified large multibyte character (eg \%U12345678)

This solution might not address the problem as originally stated, but it does address a different but very closely related one and I think it makes a lot of sense to place it here.
I don't know in which version of Vim it was implemented, but I was working on 7.4 when I tried it.
When in Edit mode, the sequence to output unicode characters is: ctrl-v u xxxx where xxxx is the code point. For instance outputting the euro sign would be ctrl-v u 20ac.
I tried it in Command mode as well and it worked. That is, to replace all instances of "20 euro" in my document with "20 €", I'd do:
:%s/20 euro/20 <ctrl-v u 20ac>/gc
In the above <ctrl-v u 20ac> is not literal, it's the sequence of keys that will output the € character.

Related

Regex: Replace "something" by a unicode character

I am trying to figure out how to find a certain character and replace it with a Unicode character. In my example, I want to find all spaces (\s) and replace them with a narrow or thin space (e.g. Unicode U+2006).
Sample Text
8. 3. 2014
Search Pattern
(\d{1,2}\.)(\s?)(\d{1,2}\.)(\s?)(\d{2,4})
Replacement Pattern
$1{UNICODE}$3{UNICODE}$5
For some reason I cannot replace by(!) a Unicode character, I can only search for one.
I am working with a RegEx App called »RegExRX 3« to test my strings. In the end, I want to be able to use it with Adobe’s InDesign GREP functionality.
I know I could just copy and paste the correct whitespace into place but I am interested in how to do it with a Unicode character.
Thanks in advance!
InDesign uses Perl-compatible regular expressions (pcre). Getting a Unicode character into the replacement string is done by \x{XXXX} where XXXX is the hexadecimal character code:
$1\x{2009}$2\x{2009}$5
But in general you can replace by any character you can type. Just put actual thin spaces into your search-and-replace dialog:
$1 $3 $5
You can use your OS's utilities to grab the thin space from the list of available characters, for Windows it's the "Character Map" tool, where the thin space can be found in the "General Punctuation" Unicode sub-range. Searching for "thin space" works as well. MacOS has the "Character Viewer", which can do the same thing.

Why Unicode Character 'MINUS SIGN' (U+2212) is NOT in regex unicode group \p{Pd} (Dash_Punctuation)?

I'm trying to collect all dash-signs to use it while analyzing raw text data. I've found that Unicode regexp \p{Pd} should match all cases, but after all, it turned out that this character − doesn't match!
Here is more info about this char:
https://www.fileformat.info/info/unicode/char/2212/index.htm
Is it a bug or a feature? Practically it's not useful stuff.
The Unicode character U+2212 MINUS SIGN is a math-related symbol, and is probably not considered as a punctuation mark; for instance, it is matched by \p{Math} but not by \p{Punctuation} (which includes \p{Dash_Punctuation}).
You may want to try using \p{Dash} instead, and check whether it covers all your needs or not...
Ref: Properties for U+2212
Edit:
Here is an "official" list of all the characters having a Dash Unicode property: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Dash=Yes:], including the U+2212 MINUS SIGN character.
In Unicode 12.0, the JavaScript regular expression:
/\p{Dash}/u
would be equivalent to:
/[\u002D\u058A\u05BE\u1400\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]/

Hive table column accept only key board characters,numbers and ignore control and ascii characters

Is there any regex or translate or any other expression in hive
consider only key board characters and ignore control characters and ascii characters in Hive table?
Example: regexp_replace(option_type,'[^a-zA-Z0-9]+','')
In the above expression only characters and numbers are considering but any keyboard special character data like %,&,*,.,?,.. available then i am getting output as blank.
Col: bhuvi?Where are you ?
Result: bhuviWhere are you
but i want output as bhuvi?Where are you?
like that if any special keyboard characters
comes then it will appear as is and any control or ascii character comes it will ignore.
you should consider that various keyboard layouts (languages) have various "special" characters, like german ö ä ü or spanish Ñ (just examples - not talking about asian, hebrew or arabic keyboards).
I see two solutions:
1.) Maybe you should define a list of allowed characters and put them into a character class, so you can heavily control what is allowed, but you might exclude most languages
2.) your you might have a look into regular expression unicode classes, you can allow any "letter" \p{L} or "number" \p{N} and even punctuation \p{P} and disallow only those characters you KNOW will cause problems like control characters \p{C}
please see see regular-expression.info for more details about Unicode Regular Expressions
edit:
IF you want to stick with english only and can assume you will only have ASCII to allow, you can either type every key you find on your keyboard in a character class, as a not complete example: /^[-a-zA-Z0-9,.-;:_!"§$%&]+$/
or
you could use an ASCII table to determine the range of allowed characters, in your case a assume from "space" to "curly closing bracket" } and trick the character class in allowing all of them: /^[ -}]+$/
I got the solution
regexp_replace(option_type,'[^a-zA-Z0-9*!#+-/#$%()_=/<>?\|&]+',' ') works

VBA ActiveDocument.Range.Text in Unicode

In VBA (Word specifically), I'm trying to use the RegExp object to search through a long document. Some of the patterns I search for include unicode character (such as a non-breaking hyphen or non-breaking space). When I access the text via
ActiveDocument.Range.Text
I get the text but stripped of unicode characters (or at least some of them, ones that I need). For example, if the text ABC-123, where the hyphen is a non-breaking (or hard) hyphen, U+2011, when I access the text using ActiveDocument.Range.Text, it displays ABC123.
I thought perhaps it just displays it incorrectly, and that the character is really there, but all the search and replace I've done don't show it. Plus, when I regex the unicode character using \u2011, it doesn't find it.
Is there another way to access the document's full content, but intact with all the unicode characters?
UPDATE: I inspected the output of the ABC123, and it appears that the character is hidden. That is, Len(str) = 7 instead of 6, what you'd expect. The following shows what is happening:
Print Asc(Mid(str, 4, 1))
=> 30
ASCII character 30, or \u001e is a record separator. When I search for this, it finds this zero-length character. I tested a wider range of unicode characters (\u2000-\u201f) and interestingly they all are detected with the \u control sequence in the regex, except for \u2011, which changes to \u001e. Even the en-space (\u2002) and em-space (\u2003) are recognized. I haven't done it for all the unicode characters, but it seems odd that I have stumbled upon one of the few that don't register.
This isn't an answer, but a workaround. When using RegExp to search for unicode characters, most will be recognized in the ActiveDocument.Range.Text variable using a \uxxxx code. If not, open a new Word document. In the body, add some text that contains the unicode character (e.g. non-breaking hyphen). Then in VBA, use the immediates window to find the ASCII character code for that character:
Print Asc(Mid(ActiveDocument.Range.Text, <char_position>, 1))
This will tell you if it is actually there (if the character doesn't show up in strings). The code you get won't actually work for every unicode character, since some of them are converted to ASCII characters (e.g. en-quad \u2000 will return ASCII 32, space, when using the Asc() function on it. Luckily, you can regex \u2000 and it will find it.).
For the non-breaking hyphen, the code that works with regex is \u001e.

What is the proper regex for Active-Directory object's names?

My application creates a SharePoint site and an Active Directory group from user input. Special characters that are mentioned in http://www.webmonkey.com/reference/Special_Characters becomes a big problem in my application. Application creates group names differently and application can't access them from name property. I want the user input to be validated from a regular expression for these characters. I googled in and found some good regex sampler and testers but they won't solve my problem. So can anybody suggest a regex for disallowing special characters which is a problem for Active Directory object names?
P.S. Application users may enter Turkish inputs, so regex should also allow Turkish characters like 'ç', 'ş', 'ö'
You should start with something like this:
^(\p{L}|\p{N}|\p{P})+$
This will match:
\p{L}: any kind of letter from any language
\p{N}: any kind of numeric character in any script
\p{P}: any kind of punctuation character.
When you query your AD, you must to escape some special characters, described here: Creating a Query Filter
If any of the following special characters must appear in the query filter as literals, they must be replaced by the listed escape sequence.
ASCII Escape sequence
character substitute
* "\2a"
( "\28"
) "\29"
\ "\5c"
NUL "\00"
In addition, arbitrary binary data may be represented using the escape sequence syntax by encoding each byte of binary data with the backslash followed by two hexadecimal digits. For example, the four-byte value 0x00000004 is encoded as "\00\00\00\04" in a filter string.