Ruby - split on nonalphanumeric characters excluding international characters? - regex

This is my regex so far which will split on non-alphanumeric characters, including international characters (ie Korean, Japanese, Chinese characters).
title = '[MV] SUNMI(선미) _ 누아르(Noir)'
title.split(/[^a-zA-Z0-9 ']/)
this is the regex to match any international character:
[^\x00-\x7F]+
Which I got from: Regular expression to match non-English characters? Let'a ssume this is 100% correct (no debating!)
How do I combine these 2 so I can split on non-alphanumeric characters, excluding international characters? The easy part is done. I just need to combine these regex's somehow.
My expected output would be something like this
["MV", "SUNMI", "선미", "누아르", "Noir"]
TLDR: I want to split on non-alphanumeric characters only (english letters, foreign characters should not be split on)

(?:[^a-zA-Z0-9](?<![^\x00-\x7F]))+
https://regex101.com/r/EDyluc/1
What is not matched (remains from split) is what you want to keep.
Explained:
(?:
[^a-zA-Z0-9] # Not Ascii AlphaNum
(?<! [^\x00-\x7F] ) # Behind, not not Ascii range (Ascii boundary)
)+
Let me know if you need a more detailed explanation.

So basically you want to split on all ascii but non-alphabet characters. You can use this regex which selects all characters within ascii range.
[ -#[-`{-~]+
This regex having ranges space to # then ignoring all uppercase letters then picks all characters from [ to backtick then ignores all lowercase letters then picks all characters from { to ~ as can be seen in ascii table.
In case you want to exclude till extended ascii characters, you can change ~ in regex with ÿ and use [ -#[-{-ÿ]+` regex.
Demo
Check out these Ruby codes,
s = '[MV] SUNMI(선미) _ 누아르(Noir)'
puts s.split(/[ -#\[-`{-~]+/)
Prints,
MV
SUNMI
선미
누아르
Noir
Online Ruby Demo

Related

Regex allow only Uppercase Extended ASCII

I need a regex to allow only Uppercase Extended ASCII characters of a maxLength I set before that it's the maximum length of the word.
Regex for uppercase letters: \P{Ll}*
Regex for extended ASCII letters: [\x00-\xFF]*
Using ^[\p{Ll}] it's not enough because I need characters to be extended ASCII(to not allow emoji or other special characters outrange ASCII extended).
How can I combine that 2 requirements ? And length of maxLength.
Thank you!!
Generally, you can use
^(?:(?=\p{Lu})\p{Latin}){1,10}$
See the regex demo. Details:
^ - start of string
(?: - start of a non-capturing group:
(?=\p{Lu})\p{Latin} - a char from Latin Unicode category class that is an uppercase letter
){1,10} - end of the group, repeat one to ten occurrences
$ - end of string.
Since you are using the regex in a DevExpress masked input component you need to enumerate all these letters in a character class. Based on Regex Latin characters filter and non latin character filer, you need
Latin-1 Supplement U+0080 - U+00FF
Latin Extended-A U+0100 - U+017F
Latin Extended-B U+0180 - U+024F
All chars that are uppercase letters in these three ranges are the ones you want to allow:
var res = []
for (var i=128; i<=591; i++) { // Get chars from \u0080 to \u024F
if (/^\p{Lu}$/u.test(String.fromCharCode(i))) { // If it is an uppercase letter
res.push(String.fromCharCode(i)); // Add it to the results
}
}
console.log(res.join(""))
The code will look like
settings.MaskExpression = "[\\u00C0-\\u00D6\\u00D8-\\u00DE\\u0100\\u0102\\u0104\\u0106\\u0108\\u010A\\u010C\\u010E\\u0110\\u0112\\u0114\\u0116\\u0118\\u011A\\u011C\\u011E\\u0120\\u0122\\u0124\\u0126\\u0128\\u012A\\u012C\\u012E\\u0130\\u0132\\u0134\\u0136\\u0139\\u013B\\u013D\\u013F\\u0141\\u0143\\u0145\\u0147\\u014A\\u014C\\u014E\\u0150\\u0152\\u0154\\u0156\\u0158\\u015A\\u015C\\u015E\\u0160\\u0162\\u0164\\u0166\\u0168\\u016A\\u016C\\u016E\\u0170\\u0172\\u0174\\u0176\\u0178\\u0179\\u017B\\u017D\\u0181\\u0182\\u0184\\u0186\\u0187\\u0189-\\u018B\\u018E-\\u0191\\u0193\\u0194\\u0196-\\u0198\\u019C\\u019D\\u019F\\u01A0\\u01A2\\u01A4\\u01A6\\u01A7\\u01A9\\u01AC\\u01AE\\u01AF\\u01B1-\\u01B3\\u01B5\\u01B7\\u01B8\\u01BC\\u01C4\\u01C7\\u01CA\\u01CD\\u01CF\\u01D1\\u01D3\\u01D5\\u01D7\\u01D9\\u01DB\\u01DE\\u01E0\\u01E2\\u01E4\\u01E6\\u01E8\\u01EA\\u01EC\\u01EE\\u01F1\\u01F4\\u01F6-\\u01F8\\u01FA\\u01FC\\u01FE\\u0200\\u0202\\u0204\\u0206\\u0208\\u020A\\u020C\\u020E\\u0210\\u0212\\u0214\\u0216\\u0218\\u021A\\u021C\\u021E\\u0220\\u0222\\u0224\\u0226\\u0228\\u022A\\u022C\\u022E\\u0230\\u0232\\u023A\\u023B\\u023D\\u023E\\u0241\\u0243-\\u0246\\u0248\\u024A\\u024C\\u024E]{1,10}";
The \u... part matches any letters from the ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽƁƂƄƆƇƉƊƋƎƏƐƑƓƔƖƗƘƜƝƟƠƢƤƦƧƩƬƮƯƱƲƳƵƷƸƼDŽLJNJǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZǴǶǷǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺȻȽȾɁɃɄɅɆɈɊɌɎ set.
The {1,10} limiting quantifier matches one to ten occurrences. You may adjust it further.
Slight modification of #Wiktor's comment that I think is easier to read:
^[^\P{Lu}\P{Latin}]{0,10}$
should match a string of a max of 10 uppercase Latin (inc. extended) characters. Using a negation class to find 10 characters that are not not uppercase nor not Latin. It does match such beautiful and definitely not cursed strings as ĦꜴꝎꞂꜨⱠƎƢƔ.

Negating a complex regex containing three parts

I need a regex which is matched when the string doesn't have both lowercase and uppercase letters.
If the string has only lowercase letters -> should be matched
If the string has only uppercase letters -> should be matched
If the string has only digits or special characters -> should be matched
For example
abc, ABC, 123, abc123, ABC123&^ - should match
AbC, A12b, AB^%12c - should not match
Basically I need an inverse/negation of the following regex:
^(?=.*[a-z])(?=.*[A-Z]).+$
Does not sound like any lookarounds would be needed.
Either match only characters that are not a-z or only characters, that are not A-Z.
^(?:[^a-z]+|[^A-Z]+)$
See this demo at regex101 (used + for one or more)
You may use
^(?!.*[A-Z].*[a-z])(?!.*[a-z].*[A-Z])\S+$
Or
^(?=(?:[^a-z]+|[^A-Z]+)$).*$
See the regex demo #1 and regex demo #2
A lookaround solution like this can be used in more complex scenarios, when you need to apply more restrictions on the pattern. Else, consider a non-lookaround solution.
Details
^ - start of string
(?!.*[A-Z].*[a-z]) - no uppercase followed with a lowercase letter
(?!.*[a-z].*[A-Z]) - no lowercase letter followed with an uppercase one
(?=(?:[^a-z]+|[^A-Z]+)$) - a positive lookahead that requires 1 or more characters other than lowercase ASCII letters ([^a-z]+) to the end of the string, or 1 or more characters other than uppercase ASCII letters ([^A-Z]+) to the end of the string
.+ - 1+ chars other than line break chars
$ - end of string.
You can use this regex
^(([A-Z0-9?&%^](?![a-z]))+|([a-z0-9?&%^](?![A-Z]))+)$
You can test more cases here.
I've only added the characcter ?&%^ as possible character, but you could add which ever you like.
I would go with:
^(?:[^a-z]+?|[^A-Z]+?)$
It translates to "If the entire string is composed of non-lowercase letters or non-uppercase letters then match the string."
Lazy quantifiers +? are used so that the end-string $ anchor is obeyed when the multiline flag is enabled. If you're only validating a single-line string the you can simply use + without the question mark.
If you have a whitelist of specific allowed special chars then change [^A-Z] into [A-Z0-9()_+=-] and list the allowed special chars.
https://regex101.com/r/Wg6tLn/1

How would I translate this to RegEx?

I'm having trouble to translate this to RegEx:
Actual file format (For excel spreadsheet):
[demo-_File.xls]'SheEt_nAme'!CA
[samPle file 2.xls]'demo Sheet'!D
Inside the bracket and single quote:
Accept any characters from a to z (Regardless of case)
Accepts special characters -_. and space.
After the exclamation mark, it should accept up to 4 capital characters.
Here is my suggestion:
\[[\w\s&.-]*\]'[\w\s&.-]+'![A-Z]{1,4}
In JS:
var re = /\[[\w\s&.-]*\]'[\w\s&.-]+'![A-Z]{1,4}/gi;
[\w\s&.-]* will match all alphanumeric characters and _ with spaces, &, . and -. The [A-Z]{1,4} will match 1 to 4 uppercase English letters. The i option will make matching case-insensitive. If you want to allow digits in the last part, just revert them to [A-Z0-9]{1,4}.
See demo

Regex, every non-alphanumeric character except white space or colon

How can I do this one anywhere?
Basically, I am trying to match all kinds of miscellaneous characters such as ampersands, semicolons, dollar signs, etc.
[^a-zA-Z\d\s:]
\d - numeric class
\s - whitespace
a-zA-Z - matches all the letters
^ - negates them all - so you get - non numeric chars, non spaces and non colons
This should do it:
[^a-zA-Z\d\s:]
If you want to treat accented latin characters (eg. à Ñ) as normal letters (ie. avoid matching them too), you'll also need to include the appropriate Unicode range (\u00C0-\u00FF) in your regex, so it would look like this:
/[^a-zA-Z\d\s:\u00C0-\u00FF]/g
^ negates what follows
a-zA-Z matches upper and lower case letters
\d matches digits
\s matches white space (if you only want to match spaces, replace this with a space)
: matches a colon
\u00C0-\u00FF matches the Unicode range for accented latin characters.
nb. Unicode range matching might not work for all regex engines, but the above certainly works in Javascript (as seen in this pen on Codepen).
nb2. If you're not bothered about matching underscores, you could replace a-zA-Z\d with \w, which matches letters, digits, and underscores.
Try this:
[^a-zA-Z0-9 :]
JavaScript example:
"!##$%* ABC def:123".replace(/[^a-zA-Z0-9 :]/g, ".")
See a online example:
http://jsfiddle.net/vhMy8/
In JavaScript:
/[^\w_]/g
^ negation, i.e. select anything not in the following set
\w any word character (i.e. any alphanumeric character, plus underscore)
_ negate the underscore, as it's considered a 'word' character
Usage example - const nonAlphaNumericChars = /[^\w_]/g;
No alphanumeric, white space or '_'.
var reg = /[^\w\s)]|[_]/g;
If you mean "non-alphanumeric characters", try to use this:
var reg =/[^a-zA-Z0-9]/g //[^abc]
This regex works for C#, PCRE and Go to name a few.
It doesn't work for JavaScript on Chrome from what RegexBuddy says. But there's already an example for that here.
This main part of this is:
\p{L}
which represents \p{L} or \p{Letter} any kind of letter from any language.`
The full regex itself: [^\w\d\s:\p{L}]
Example: https://regex101.com/r/K59PrA/2
Previous solutions only seem reasonable for English or other Latin-based languages without accents, etc. Those answers are for that reason not generalised to answer the question.
According to the Whitespace character article on Wikipedia, these are all the whitespace characters in Unicode:
U+0009, U+000A, U+000B, U+000C, U+000D, U+0020, U+0085, U+00A0, U+1680, U+180E, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2007, U+2008, U+2009, U+200A, U+200B, U+200C, U+200D, U+2028, U+2029, U+202F, U+205F, U+2060, U+3000, U+FEFF
So in my opinion, the most inclusive solution would be (might be slow, but this is about accuracy):
\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF
Thus, to answer OP's question to include "every non-alphanumeric character except white space or colon", prepend a hat ^ to not include above characters and add the colon to that, and surround the regex in [ and ] to instruct it to 'any of these characters':
"[^:\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]"
Debuggex Demo
Bonus: solution for R
trimws2 <- function(..., whitespace = "[\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]") {
trimws(..., whitespace = whitespace)
}
This is even faster than trimws() itself which sets " \t\n\r".
microbenchmark::microbenchmark(trimws2(" \t\r\n"), trimws(" \t\r\n"))
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> trimws2(" \\t\\r\\n") 29.177 29.875 31.94345 30.4990 31.3895 105.642 100 a
#> trimws(" \\t\\r\\n") 45.811 46.630 48.25076 47.2545 48.2765 116.571 100 b
Try to add this:
^[^a-zA-Z\d\s:]*$
This has worked for me... :)
[^\w\s-]
Character set of characters which not:
Alphanumeric
Whitespace
Colon

How to write regular expression to match only numbers, letters and dashes?

I need an expression that will only accept:
numbers
normal letters (no special characters)
-
Spaces are not allowed either.
Example:
The regular expression should match:
this-is-quite-alright
It should not match
this -is/not,soålright
You can use:
^[A-Za-z0-9-]*$
This matches strings, possibly empty, that is wholly composed of uppercase/lowercase letters (ASCII A-Z), digits (ASCII 0-9), and a dash.
This matches (as seen on rubular.com):
this-is-quite-alright
and-a-1-and-a-2-and-3-4-5
yep---------this-is-also-okay
And rejects:
this -is/not,soålright
hello world
Explanation:
^ and $ are beginning and end of string anchors respectively
If you're looking for matches within a string, then you don't need the anchors
[...] is a character class
a-z, A-Z, 0-9 in a character class define ranges
- as a last character in a class is a literal dash
* is zero-or-more repetition
regular-expressions.info
Anchors, Character Class, Repetition
Variation
The specification was not clear, but if - is only to be used to separate "words", i.e. no double dash, no trailing dash, no preceding dash, then the pattern is more complex (only slightly!)
_"alpha"_ separating dash
/ \ /
^[A-Za-z0-9]+(-[A-Za-z0-9]+)*$
\__________/| \__________/|\
"word" | "word" | zero-or-more
\_____________/
group together
This matches strings that is at least one "word", where words consists of one or more "alpha", where "alpha" consists of letters and numbers. More "words" can follow, and they're always separated by a dash.
This matches (as seen on rubular.com):
this-is-quite-alright
and-a-1-and-a-2-and-3-4-5
And rejects:
--no-way
no-way--
no--way
[A-z0-9-]+
But your question is confusing as it asks for letters and numbers and has an example containing a dash.
This is a community wiki, an attempt to compile links to related questions about this "URL/SEO slugging" topic. Community is invited to contribute.
Related questions
regex/php: how can I convert 2+ dashes to singles and remove all dashes at the beginning and end of a string?
-this--is---a-test-- becomes this-is-a-test
Regex for [a-zA-Z0-9-] with dashes allowed in between but not at the start or end
allow spam123-spam-eggs-eggs1 reject eggs1-, -spam123, spam--spam
Translate “Lorem 3 ipsum dolor sit amet” into SEO friendly “Lorem-3-ipsum-dolor-sit-amet” in Java?
Related tags
[slug]