Go regexp for printable characters - regex

I have a Go server application that processes and saves a user-specified name. I really don't care what the name is; if they want it to be in hieroglyphs or emojis that's fine, as long as most clients can display it. Based on this question for C# I was hoping to use
^[^\p{Cc}\p{Cn}\p{Cs}]{1,50}$
basically 1-50 characters that are not control characters, unassigned characters, or partial UTF-16 characters. But Go does not support Cn. Basically I can't find a reasonable regexp that will match any printable unicode string but not "퟿͸", for example.
I want to use regex because the clients are not written in Go and I want to be able to precisely match the server validation. It's not clear how to match functions like isPrint in other languages.
Is there any way to do this other than hard-coding the unassigned unicode ranges into my application and separately checking for those?

You probably want to use just these Unicode character classes:
L (Letter)
M (Mark)
P (Punctuation)
S (Symbol)
That would give you this [positive] regular expression:
^[\pL\pM\pN\pP\pS]+$
Alternatively, test for those Unicode character classes which you don't want:
Z (Separator)
C (Other)
Again, a positive regular expression:
^[^\pZ\pC]+$

Related

Match Unicode character with regular expression

I can use regular expressions in VBA for Word 2019:
Dim RegEx As New RegExp
Dim Matches As MatchCollection
RegEx.Pattern = "[\d\w]+"
Text = "HelloWorld"
Set Matches = RegEx.Execute(Text)
But how can I match all Unicode characters and all digits too?
\p{L} works fine for me in PHP, but this doesn't work for me in VBA for Word 2019.
I would like to find words with characters and digits. So in PHP I use for this [\p{L}\p{N}]+. Which pattern can I use for this in VBA?
Currently, I would like to match words with German character, like äöüßÄÖÜ. But maybe I need this for other languages too.
But how can I match all Unicode characters and all digits too?
"VBScript Regular Expressions 5.5" (which I am pretty sure you are using here) are not "VBA Regular Expressions", they are a COM library that you can use in - among other things - VBA. They do not support Unicode with the built-in metacharacters (such as \w) and they have no knowledge of Unicode character classes (such as \p{L}). But of course you can still match Unicode characters with them.
Direct Matches
The simplest way is of course to directly use the Unicode characters you search for in the pattern. VBA uses Unicode strings, so matching Unicode is not a problem per se. Representing Unicode in your VBA source code, which itself is not Unicode, is a different matter. But ChrW() can help with that.
Assuming you have a certain character you want to match,
RegEx.Pattern = ChrW(&h4E16) & ChrW(&h754C)
Set Matches = RegEx.Execute(Text)
Msgbox Matches(0)
The above uses hex numbers (&h...) and ChrW() to create the Unicode characters U+4E16 and U+754C (世界) at run-time. When they are in your text, they will be found. This is tedious, but it works well if you already know what words you're looking for.
Ranges
If you want to match character ranges, you can do that as well. Use the start point and end point of the range. For example, the basic block of the "CJK Unified Ideographs" range goes from U+4E00 to U+9FFF:
RegEx.Pattern = "[" + ChrW(&h4E00) & "-" & ChrW(&h9FFF) & "]+"
Set Matches = RegEx.Execute(Text)
Msgbox Matches(0)
So this creates a natural range just like [a-z]+ to span all of the CJK characters. You'd have to define which ranges you want to match, so it's less convenient has having built-in support, but nothing is stopping you.
Caveats
The above is about matching Characters inside of the BMP (Basic Multilingual Plane). Characters outside of the BMP, such as Emoji, is a lot more difficult because of the way these characters work in Unicode. It's still possible, but it's not going to be pretty.
There are multiple ways of representing the same character. For example, ä could be represented by its own, singluar code-point, or by a followed by a second code-point for the dots (U+0308 "◌̈"). Since there is no telling how your input string represents certain characters, you should look into Unicode Normalization to make strings uniform before you search in them. In VBA this can be done by using the Win32 API.
Helpers
You can research Unicode ranges manually, but since there are so many of them, it's easy to miss some. I remember a useful helper for manually picking Unicode ranges, which now still lives on the Internet Archive: http://web.archive.org/web/20191118224127/http://kourge.net/projects/regexp-unicode-block
It allows you to qickly build regexes that span multiple ranges. It's aimed at JavaScript, but it's easy enough to adapt the output for VBA code.

Hive table column accept only key board characters,numbers and ignore control and ascii characters

Is there any regex or translate or any other expression in hive
consider only key board characters and ignore control characters and ascii characters in Hive table?
Example: regexp_replace(option_type,'[^a-zA-Z0-9]+','')
In the above expression only characters and numbers are considering but any keyboard special character data like %,&,*,.,?,.. available then i am getting output as blank.
Col: bhuvi?Where are you ?
Result: bhuviWhere are you
but i want output as bhuvi?Where are you?
like that if any special keyboard characters
comes then it will appear as is and any control or ascii character comes it will ignore.
you should consider that various keyboard layouts (languages) have various "special" characters, like german ö ä ü or spanish Ñ (just examples - not talking about asian, hebrew or arabic keyboards).
I see two solutions:
1.) Maybe you should define a list of allowed characters and put them into a character class, so you can heavily control what is allowed, but you might exclude most languages
2.) your you might have a look into regular expression unicode classes, you can allow any "letter" \p{L} or "number" \p{N} and even punctuation \p{P} and disallow only those characters you KNOW will cause problems like control characters \p{C}
please see see regular-expression.info for more details about Unicode Regular Expressions
edit:
IF you want to stick with english only and can assume you will only have ASCII to allow, you can either type every key you find on your keyboard in a character class, as a not complete example: /^[-a-zA-Z0-9,.-;:_!"§$%&]+$/
or
you could use an ASCII table to determine the range of allowed characters, in your case a assume from "space" to "curly closing bracket" } and trick the character class in allowing all of them: /^[ -}]+$/
I got the solution
regexp_replace(option_type,'[^a-zA-Z0-9*!#+-/#$%()_=/<>?\|&]+',' ') works

Regular expression for alphanumeric and underscore c#

I am working with ASP.NET MVC 5 application in which I want to add dataannotation validation for Name field.
That should accept any combination of number,character and under score only.
I tried by this but not working :
RegularExpression("([a-zA-Z0-9_ .&'-]+)", ErrorMessage = "Invalid.")]
Try this regex written under the regexr.com site.
Criteria - alphanumeric,underscore and space.
http://regexr.com/3agii
([a-zA-Z0-9_\s]+)
You are using a character class, that is the thing between the square brackets ([a-zA-Z0-9_ .&'-]). Within that square brackets you can define all characters that should be matched by this class. So, now it is easy: you allow characters you don't want to match.
Based on your "try" you could change this to
[a-zA-Z0-9_]
that seem to be the characters you want to match. But is it really what you need? Are that really the only characters that are possible for that field?
If yes then you are done.
If no, you probably want to add all characters of all languages. Luckily there is a Unicode property for that:
\p{L} All letter characters
There is another predefined group that could be useful for you:
\w matches any word character (The definition can also be found in the first link, includes the Unicode categories Ll,Lu,Lt,Lo,Lm,Nd,Pc, that is basically [a-zA-Z0-9_] but Unicode style with all letters and more connecting characters)
But still, if you want to match real names this will not cover all possible names. I have another answer on this topic here

Is there a regex way to detect whether a character can be part of a word or not?

The "tricky" part of this question is that what I mean by alphabeth is not just the 26 characters. It should also include anything alphabeth like, including accented characters and hebrew's alibeth, etc.etc.
Why I need them?
I want to split texts into words.
Alphabeths like latin alphabeth, hebrew's alibeth, arab abjads, are separated by space.
Chinese characters are separated by nothing.
So I think I should separate texts by anything that's not alphabeth.
In other word, a, b, c, d, é is fine.
駅,南,口,第,自,転,車.,3,5,6 is not and all such separator should be it's own words. Or stuff like that.
In short I want to detect whether a character may be a word by itself, or can be part of a word.
What have I tried?
Well you can check the question here I asked a long time ago:
How can we separate utf-8 characters into words if some of the characters are chinese?
I implement the only answer there but then I found out that the chinese characters aren't split. Why not split based on nothing? Well, that means the alphabeths are splitted too.
If all those alphabeths "stick" together that I can separate them based on UTF, that would be fine too.
I will just use the answer at How can we separate utf-8 characters into words if some of the characters are chinese?
and "pull out" all non alphabeth characters.
Not a perfect solution, but good enough for me because western characters and chinese characters rarely show up on the same text anyway.
Maybe you shouldn't do this with regular expressions but with good old string index scanning instead.
The Hebrew, Chinese, Korean etc. alphabets are all in consecutive ranges of unicode code-points. So you could easily detect the alphabet by reading the unicode value of the character and then checking which unicode block it belongs to.
Jan Goyvaerts (of PowerGrep fame) once showed me this very useful syntax to do just this:
(?<![\p{M}\p{L}])word(?![\p{M}\p{L}])
This expression uses a regex lookbehind and a regex lookahead to ensure that the boundaries of the word are such that there is no letter or diacritic mark on either side.
Why is this regex better than simply using "\b"? The strength of this regex is the incorporation of \p{M} to include diacritics. When the normal word boundary marker (\b) is used, regex engines will find word breaks at the places of many diacritics, even though the diacritics are actually part of the word (this is the case, for instance, with Hebrew diacritics. For an example, take the Hebrew word גְּבוּלוֹת, and run a regex of "\b." on it - you'll see how it actually breaks the word into word different parts, at each diacritic point). The regex above fixes this by using a Unicode Character Class to ensure that diacritics are always considered part of the word and not breaks within the word.

internationalized regular expression in postgresql

How can write regular expressions to match names like 'José' in postgres.. In other words I need to setup a constraint to check that only valid names are entered, but want to allow unicode characters also.
Regular expressions, unicode style have some reference on this. But, it seems I can't write it in postgres.
If it is not possible to write a regex for this, will it be sufficient to check only on client side using javascript
PostgreSQL doesn't support character classes based on the Unicode Character Database like .NET does. You get the more-standard [[:alpha:]] character class, but this is locale-dependent and probably won't cover it.
You may be able to get away with just blacklisting the ASCII characters you don't want, and allowing all non-ASCII characters. eg something like
[^\s!"#$%&'()*+,\-./:;<=>?\[\\\]^_`~]+
(JavaScript doesn't have non-ASCII character classes either. Or even [[:alpha:]].)
For example, given v_text as a text variable to be sanitzed:
-- Allow internationalized text characters and remove undesired characters
v_text = regexp_replace( lower(trim(v_text)), '[!"#$%&()*+,./:;<=>?\[\\\]\^_\|~]+', '', 'g' );