Allow only letters and digits in strings but without confusables - regex

Say I want usernames to only consist of letters and digits regardless of language.
I think I might accomplish this with the following regex parts
(?>\p{L}[\p{Mn}\p{Mc}]*) //match any letter, including those consisting of two code points
\p{Nd} //match any digit
Now I have the problem that users may pretend to be other users by using a username that has the same look like the one from another user (homograph attack). admin vs admin would be an example.
I guess it's not possible to easily exclude characters that are both letters and confusables using a regex but how about outside the context of the regexes. Do the unicode ids of confusables lie in certain ranges that we could filter or something like that?

Confusables... Then it comes to mind that you are talking about Cyrillic characters. If that's right, you can easily exclude them from your RegEx. Consider following ranges:
Cyrillic: U+0400–U+04FF, 256 characters
Cyrillic Supplement: U+0500–U+052F, 48 characters
Cyrillic Extended-A: U+2DE0–U+2DFF, 32 characters
Cyrillic Extended-B: U+A640–U+A69F, 96 characters
Phonetic Extensions: U+1D2B, U+1D78, 2 Cyrillic characters
Then:
/[^\x{0400}-\x{04FF}\x{0500}-\x{052F}\x{2DE0}-\x{2DFF}\x{A640}-\x{A69F}\x{1D2B}\x{1D78}]/u
Or simply by using [^\p{Cyrillic}]

The Unicode standard includes a list of confusable characters at http://www.unicode.org/Public/security/revision-02/confusables.txt
This list is incomplete according to some, and too aggressive according to others, but take a look at it in order to understand how difficult the problem is to solve generically.

Related

Hive table column accept only key board characters,numbers and ignore control and ascii characters

Is there any regex or translate or any other expression in hive
consider only key board characters and ignore control characters and ascii characters in Hive table?
Example: regexp_replace(option_type,'[^a-zA-Z0-9]+','')
In the above expression only characters and numbers are considering but any keyboard special character data like %,&,*,.,?,.. available then i am getting output as blank.
Col: bhuvi?Where are you ?
Result: bhuviWhere are you
but i want output as bhuvi?Where are you?
like that if any special keyboard characters
comes then it will appear as is and any control or ascii character comes it will ignore.
you should consider that various keyboard layouts (languages) have various "special" characters, like german ö ä ü or spanish Ñ (just examples - not talking about asian, hebrew or arabic keyboards).
I see two solutions:
1.) Maybe you should define a list of allowed characters and put them into a character class, so you can heavily control what is allowed, but you might exclude most languages
2.) your you might have a look into regular expression unicode classes, you can allow any "letter" \p{L} or "number" \p{N} and even punctuation \p{P} and disallow only those characters you KNOW will cause problems like control characters \p{C}
please see see regular-expression.info for more details about Unicode Regular Expressions
edit:
IF you want to stick with english only and can assume you will only have ASCII to allow, you can either type every key you find on your keyboard in a character class, as a not complete example: /^[-a-zA-Z0-9,.-;:_!"§$%&]+$/
or
you could use an ASCII table to determine the range of allowed characters, in your case a assume from "space" to "curly closing bracket" } and trick the character class in allowing all of them: /^[ -}]+$/
I got the solution
regexp_replace(option_type,'[^a-zA-Z0-9*!#+-/#$%()_=/<>?\|&]+',' ') works

Generic regex umlaut solution?

Is there a generic (non-)word regex that covers all mutations of characters on this globe? I am developing an application that should handle all languages.
Technically I want to split sentences by words. Splitting them by nonword characters (\W) splits by 'ä' too. A static workaround is not an option since and explicitely covering all mutations on this world (éçḮñ and thousands more) is impossible.
I can't give you something that will work on all languages because I don't know enough languages to judge whether there will be edge cases.
My suggestion:
Split on whitespace (\s+).
Trim punctuation characters from start/end of each "word" you got in step 1 (replace ^\p{P}+|\p{P}+$ with nothing - the QRegularExpression docs say that it supports Unicode fully, so there's hope this will work)
Unless you care about preserving punctuation in examples like This is Charles' car, this should go a long way without removing punctuation within words like it's or Marne-sur-Seine.

How to verify form input using HTML5 input verification

I have tried finding a full list of patterns to use for verifying input via HTML5 form verification for various types, specifically url, email, tel and such, but I couldn't find any. Currently, the built-in versions of these input verifications are far from perfect (and tel doesn't even check if the thing you're entering is a phone number). So I was wondering, which patterns could I use for verifying the user is entering the right format in the inputs?
Here are a few examples of cases where the default verification allows input that is not supposed to be allowed:
type="email"
This field allows emails that have incorrect domains after the #, and it allows addresses to start or end with a dash or period, which isn't allowed either. So, .example-#x is allowed.
type="url"
This input basically allows any input that starts with http:// (Chrome) and is followed by anything other than a few special characters such as those that have a function in URLs (\, #, #, ~, etc). In FF, all that's checked is if it starts with http:, followed by anything other than : (even just http: is allowed in FF). IE does the same as FF, except that it doesn't disallow http::.
For example: http://. is allowed in all three. And so is http://,.
type="tel"
There currently is no built-in verification for phone numbers in any of the major browsers (it functions 100% the same as a type="text", other than telling mobile browsers which kind of keyboard to display.
So, since the browsers don't show a consistent behaviour in each of these cases, and since the behaviour they do show is very basic with many false positives, what can I do to verify my HTML forms (still using HTML5 input verification)?
PS: I'm posting this because I would find it useful to have a complete list of form verification patterns myself, so I figured it might be useful for others too (and of course others can post their solutions too).
These patterns aren't necessarily simple, but here's what I think works best in every situation. Keep in mind that (quite recently) Internationalized Domain Names (IDNs) are available too. With that, an un-testable amount of characters are allowed in URLs (there still exist lots of characters that aren't allowed in domain names, but the list of allowed characters is so big, and will change so often for different Top-Level Domains, that it's not practical to keep up with them). If you want to support the internationalized domain names, you should use the second URL pattern, otherwise, use the first.
##TL;DR:
Here's a live demo to see the following patterns in action. Scroll down for an explanation, reasoning and analysis of these patterns.
URLs
https?:\/\/(?![^\/]{253}[^\/])((?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5]))\.){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(\/.*)?
https?:\/\/(?!.{253}.+$)((?!-.*|.*-\.)([^ !-,\.\/:-#\[-`{-~]{1,63}\.)+([^ !-\/:-#\[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9])\.){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(\/.*)?
Emails
(?!(^[.-].*|[^#]*[.-]#|.*\.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+\/=?^_`{|}~.-]+#)(?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}
Phone numbers
((\+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((\+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Western-style names
([A-ZΆ-ΫÀ-ÖØ-Þ][A-ZΆ-ΫÀ-ÖØ-Þa-zά-ώß-öø-ÿ]{1,19} ?){1,10}
##URLs, without IDN support
https?:\/\/(?![^\/]{253}[^\/])((?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5]))\.){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(\/.*)?
Explanation:
DNSes
URLs should always start with http:// or https://, since we don't want links to other protocols.
Domain names should not start or end with -
Domain names can be a maximum of 63 characters each (so a maximum of 63 characters between each dot), and the total length (including dots) cannot exceed 253 (or 255? be safe and bet on 253.) characters [1].
Non-IDNs can only support the letters of the Latin alphabet, the numbers 0 through 9, and a dash.
Top-level domains of non-IDNs only contain at least the letters of the Latin alphabet [2].
I've set an arbitrary limit of 15 letters, since there are currently no domains that exceed 13 characters (".international"), which most likely won't change any time soon.
IPs
Special cases such as 0.0.0.0, 127.0.0.1, etc. are not checked for
IPs that have padded zeroes in them are not allowed (for example 01.1.1.1) [4].
IP numbers can only go from 0 through 255. 256 is not allowed.
Note that the default http:.* pattern built into modern browsers will always be enforced, so even if you remove the https?:// at the start in this pattern, it will still be enforced. Use type="text" to avoid it.
##URLs, with IDN support
https?:\/\/(?!.{253}.+$)((?!-.*|.*-\.)([^ !-,\.\/:-#\[-`{-~]{1,63}\.)+([^ !-\/:-#\[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9])\.){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(\/.*)?
Explanation:
Since there is a huge amount of characters that are allowed in IDNs, it's not practically possible to list every possible combination in a HTML attribute (you'd get a huge pattern, so in that case it's much better to test it by some other method than regex) [5].
Disallowed characters in domain names are: !"#$%&'()*+, ./ :;<=>?# [\]^_`` {|}~ with the exception of a period as domain seperator.
These are matched in the ranges [!-,] [\.\/] [:-#] [\[-``] [{-~].
All other characters are allowed in this input field
TLDs are allowed to have the same letters in them, up to an arbitrary limit of 15 characters (like with the non-IDN URLs).
Alternatively, TLDs can be of the format xn--* with * being an encoded version of the actual TLD. This encoding uses 2 Latin letters or Arabic numerals per original character, so the arbitrary limit here is doubled to 30.
##Email addresses
(?!(^[.-].*|[^#]*[.-]#|.*\.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+\/=?^_`{|}~.-]+#)(?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}
Explanation:
Since email addresses require a whole lot more than this pattern to be 100% foolproof, this will cover the near full 100% of them. A 100% complete pattern does exist, but contains PCRE (PHP)-only regex lookaheads, so it won't work in HTML forms.
Email addresses can only contain letters of the Latin alphabet, the numbers 0-9, and the characters in !#$%&'*+\/=?^_``{|}~.- [6].
Accents are not universally supported [7], but if needed, post a comment, and I could perhaps write a version that meets the RFC 6530 standard.
The local part (before the # can only be 63 characters long, and the total address can only be 254 characters long [8].
Addresses may not start or end with a - or ., and no two dots may appear consecutively [8].
The domain may not be an IP address [9].
Other than that, I only included the non-IDN part of the pattern. IDNs are allowed too though, so those will result in false negatives.
##Phone numbers
((\+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((\+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Explanation:
Phone numbers must start with one of the following, where [CTRY] stands for the country code, and X stands for the first non-zero digit (such as 6 in mobile numbers),
00[CTRY]X
+[CTRY]X
0X
[CTRY]X (This is not officially correct syntax, but Chrome Autofill seems to like it for some reason.)
Spaces are allowed between the digits (see the second pattern for the space-less version), except before the non-zero X as defined above.
Phone numbers must be exactly 9 digits long, other than the part before the first non-zero X as defined above.
This regex is just for 10-digit phone numbers. Since phone number lengths may vary between countries, it's best to use a less strict version of this pattern, or modify it to work for the desired countries. So, this pattern should generally be used as a kind of template pattern.
##Extra: Western-style names
([A-ZΆ-ΫÀ-ÖØ-Þ][A-ZΆ-ΫÀ-ÖØ-Þa-zά-ώß-öø-ÿ]{1,19} ?){1,10}
Yes, I know, I'm very western-centric, but this may be useful too, since it might be difficult to make this too, and in case you're making a site for western people too, this will always work (Asian names have a representation in exactly this format too).
All names must start with an uppercase letter
Uppercase letters may occur in the middle of names (such as John McDoe)
Names must be at least 2 letters long
I've set an arbitrary maximum of 10 names (these people probably won't mind), each of which can be at most 20 letters long (the length of "Werbenjagermanjensen", who happens to be #1).
Latin and Greek letters are allowed, including all accented Latin and Greek letters (list) and Icelandic letters (ÐÞ ðþ):
A-Z matches all uppercase Latin letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
Ά-Ϋ matches all uppercase Greek letters, including the accented ones: Ά·ΈΉΊ΋Ό΍ΎΏΐ ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡ΢ΣΤΥΦΧΨΩ ΪΫ.
À-ÖØ-Þ matches all uppercase accented Latin letters, and the Ð and Þ: ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ. In between there's also the character × (between Ö and Ø), which is left out this way.
a-z matches all lowercase Latin letters: abcdefghijklmnopqrstuvwxyz
ά-ώ matches all lowercase Greek letters, including the accented ones: άέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
ß-öø-ÿ matches all lowercase accented Latin letters, and the ß, ð and þ: ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ. In between there's also the character ÷ (between ö and ø), which is left out this way.
##References
https://en.wikipedia.org/wiki/Domain_Name_System#Domain_name_syntax → https://www.rfc-editor.org/rfc/rfc1034#section-3.1
https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains / https://www.icann.org/resources/pages/tlds-2012-02-25-en
https://en.wikipedia.org/wiki/Domain_name#Technical_requirements_and_process / What are the allowed characters in a subdomain?
Based on the fact neither browsers nor the Windows cmd line allow the padded format.
What are the allowed characters in a subdomain? → http://www.domainnameshop.com/faq.cgi?id=8&session=106ee5e67d523298
https://en.wikipedia.org/wiki/Email_address#Local_part / What characters are allowed in an email address?
https://en.wikipedia.org/wiki/Email_address#Internationalization
https://en.wikipedia.org/wiki/Email_address#Syntax → https://www.rfc-editor.org/rfc/rfc5321#section-4.5.3.1
Sending Email using IP Address instead of Domain Name

Is there a regex way to detect whether a character can be part of a word or not?

The "tricky" part of this question is that what I mean by alphabeth is not just the 26 characters. It should also include anything alphabeth like, including accented characters and hebrew's alibeth, etc.etc.
Why I need them?
I want to split texts into words.
Alphabeths like latin alphabeth, hebrew's alibeth, arab abjads, are separated by space.
Chinese characters are separated by nothing.
So I think I should separate texts by anything that's not alphabeth.
In other word, a, b, c, d, é is fine.
駅,南,口,第,自,転,車.,3,5,6 is not and all such separator should be it's own words. Or stuff like that.
In short I want to detect whether a character may be a word by itself, or can be part of a word.
What have I tried?
Well you can check the question here I asked a long time ago:
How can we separate utf-8 characters into words if some of the characters are chinese?
I implement the only answer there but then I found out that the chinese characters aren't split. Why not split based on nothing? Well, that means the alphabeths are splitted too.
If all those alphabeths "stick" together that I can separate them based on UTF, that would be fine too.
I will just use the answer at How can we separate utf-8 characters into words if some of the characters are chinese?
and "pull out" all non alphabeth characters.
Not a perfect solution, but good enough for me because western characters and chinese characters rarely show up on the same text anyway.
Maybe you shouldn't do this with regular expressions but with good old string index scanning instead.
The Hebrew, Chinese, Korean etc. alphabets are all in consecutive ranges of unicode code-points. So you could easily detect the alphabet by reading the unicode value of the character and then checking which unicode block it belongs to.
Jan Goyvaerts (of PowerGrep fame) once showed me this very useful syntax to do just this:
(?<![\p{M}\p{L}])word(?![\p{M}\p{L}])
This expression uses a regex lookbehind and a regex lookahead to ensure that the boundaries of the word are such that there is no letter or diacritic mark on either side.
Why is this regex better than simply using "\b"? The strength of this regex is the incorporation of \p{M} to include diacritics. When the normal word boundary marker (\b) is used, regex engines will find word breaks at the places of many diacritics, even though the diacritics are actually part of the word (this is the case, for instance, with Hebrew diacritics. For an example, take the Hebrew word גְּבוּלוֹת, and run a regex of "\b." on it - you'll see how it actually breaks the word into word different parts, at each diacritic point). The regex above fixes this by using a Unicode Character Class to ensure that diacritics are always considered part of the word and not breaks within the word.

Regex match, quite simple:

I'm looking to match Twitter syntax with a regex.
How can I match anything that is "#______" that is, begins with an # symbol, and is followed by no spaces, just letters and numbers until the end of the word? (To tweeters, I want to match someone's name in a reply)
Go for
/#(\w+)/
to get the matching name extracted as well.
#\w+
That simple?
It should be noted that Twitter no longer allows usernames longer than 15 characters, so you can also match with:
#\w{1,15}
There are still apparently a few people with usernames longer than 15 characters, but testing on 15 would be better if you want to exclude likely false positives.
There are apparently no rules regarding whether underscores can be used the the beginning or end of usernames, multiple underscores, etc., and there are accounts with single-letter names, as well as someone with the username "_".
#[\d\w]+
\d for a digit character
\w for a word character
[] to denote a character class
+ to represent more than one instances of the character class
Note that these specifiers for word and digit characters are language dependent. Check the language specification to be sure.
There is a very extensive API for how to get valid twitter names, mentions, etc. The Java version of the API provided by Twitter can be found on github twitter-text-java. You may want to take a look at it to see if this is something you can use.
I have used it to validate Twitter names and it works very well.