Regex, every non-alphanumeric character except white space or colon - regex

How can I do this one anywhere?
Basically, I am trying to match all kinds of miscellaneous characters such as ampersands, semicolons, dollar signs, etc.

[^a-zA-Z\d\s:]
\d - numeric class
\s - whitespace
a-zA-Z - matches all the letters
^ - negates them all - so you get - non numeric chars, non spaces and non colons

This should do it:
[^a-zA-Z\d\s:]

If you want to treat accented latin characters (eg. à Ñ) as normal letters (ie. avoid matching them too), you'll also need to include the appropriate Unicode range (\u00C0-\u00FF) in your regex, so it would look like this:
/[^a-zA-Z\d\s:\u00C0-\u00FF]/g
^ negates what follows
a-zA-Z matches upper and lower case letters
\d matches digits
\s matches white space (if you only want to match spaces, replace this with a space)
: matches a colon
\u00C0-\u00FF matches the Unicode range for accented latin characters.
nb. Unicode range matching might not work for all regex engines, but the above certainly works in Javascript (as seen in this pen on Codepen).
nb2. If you're not bothered about matching underscores, you could replace a-zA-Z\d with \w, which matches letters, digits, and underscores.

Try this:
[^a-zA-Z0-9 :]
JavaScript example:
"!##$%* ABC def:123".replace(/[^a-zA-Z0-9 :]/g, ".")
See a online example:
http://jsfiddle.net/vhMy8/

In JavaScript:
/[^\w_]/g
^ negation, i.e. select anything not in the following set
\w any word character (i.e. any alphanumeric character, plus underscore)
_ negate the underscore, as it's considered a 'word' character
Usage example - const nonAlphaNumericChars = /[^\w_]/g;

No alphanumeric, white space or '_'.
var reg = /[^\w\s)]|[_]/g;

If you mean "non-alphanumeric characters", try to use this:
var reg =/[^a-zA-Z0-9]/g //[^abc]

This regex works for C#, PCRE and Go to name a few.
It doesn't work for JavaScript on Chrome from what RegexBuddy says. But there's already an example for that here.
This main part of this is:
\p{L}
which represents \p{L} or \p{Letter} any kind of letter from any language.`
The full regex itself: [^\w\d\s:\p{L}]
Example: https://regex101.com/r/K59PrA/2

Previous solutions only seem reasonable for English or other Latin-based languages without accents, etc. Those answers are for that reason not generalised to answer the question.
According to the Whitespace character article on Wikipedia, these are all the whitespace characters in Unicode:
U+0009, U+000A, U+000B, U+000C, U+000D, U+0020, U+0085, U+00A0, U+1680, U+180E, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2007, U+2008, U+2009, U+200A, U+200B, U+200C, U+200D, U+2028, U+2029, U+202F, U+205F, U+2060, U+3000, U+FEFF
So in my opinion, the most inclusive solution would be (might be slow, but this is about accuracy):
\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF
Thus, to answer OP's question to include "every non-alphanumeric character except white space or colon", prepend a hat ^ to not include above characters and add the colon to that, and surround the regex in [ and ] to instruct it to 'any of these characters':
"[^:\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]"
Debuggex Demo
Bonus: solution for R
trimws2 <- function(..., whitespace = "[\u0009\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u200D\u2028\u2029\u202F\u205F\u2060\u3000\uFEFF]") {
trimws(..., whitespace = whitespace)
}
This is even faster than trimws() itself which sets " \t\n\r".
microbenchmark::microbenchmark(trimws2(" \t\r\n"), trimws(" \t\r\n"))
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> trimws2(" \\t\\r\\n") 29.177 29.875 31.94345 30.4990 31.3895 105.642 100 a
#> trimws(" \\t\\r\\n") 45.811 46.630 48.25076 47.2545 48.2765 116.571 100 b

Try to add this:
^[^a-zA-Z\d\s:]*$
This has worked for me... :)

[^\w\s-]
Character set of characters which not:
Alphanumeric
Whitespace
Colon

Related

RegEx to find any uppercase word followed by a colon

I need a RegEx to match an uppercase string ending with a colon. The string can contain spaces, numbers and periods. So that if:
mystring = "I have a C. GRAY CAT2:"
I want the coldfusion expression
REFind("[A-Z0-9. ][:]",mystring)
to return the number 9, matching "C. GRAY CAT2:". Instead, it is returning the number 21, matching only the colon. I hope that a correction of the regex will solve the problem. Of course I have tried many, many things.
Thank you!
I suggest using
[A-Z0-9][A-Z0-9. ]*:
See the regex demo
Details
[A-Z0-9] - an uppercase letter or digit (in case the first char can be a digit, else remove 0-9)
[A-Z0-9. ]* - zero or more uppercase letters/digits, . or space
: - a colon.
Variations
To avoid matching 345: like substrings but still allow 23 VAL: like ones, use
\b(?=[0-9. ]*[A-Z])[A-Z0-9][A-Z0-9. ]*:
See this regex demo. Here, \b(?=[0-9. ]*[A-Z]) matches a word boundary first, and then the positive lookahead (?=[0-9. ]*[A-Z]) makes sure there is an uppercase letter after 0+ digits, spaces or dots.
If you do not expect numbers at the start of the sequence, i.e. out of I have a 22 C. GRAY CAT2:, you need to extract C. GRAY CAT2, use Sebastian's suggestion (demo).
Have revised the selected answer to my own question to cover the German special characters.
[A-Z][A-Z0-9.ÜÄÖß ]*:
This appears to work, however the Germans have recently added a capital ß to their alphabet, which is surely not on most keyboards yet, and therefore will not be a problem for the RegEx for a while.

REGEX to find the first one or two capitalized words in a string

I am looking for a REGEX to find the first one or two capitalized words in a string. If the first two words is capitalized I want the first two words. A hyphen should be considered part of a word.
for Madonna has a new album I'm looking for madonna
for Paul Young has no new album I'm looking for Paul Young
for Emmerson Lake-palmer is not here I'm looking for Emmerson Lake-palmer
I have been using ^[A-Z]+.*?\b( [A-Z]+.*?\b){0,1} which does great on the first two, but for the 3rd example I get Emmerson Lake, instead of Emmerson Lake-palmer.
What REGEX can I use to find the first one or two capitalized words in the above examples?
You may use
^[A-Z][-a-zA-Z]*(?:\s+[A-Z][-a-zA-Z]*)?
See the regex demo
Basically, use a character class [-a-zA-Z]* instead of a dot matching pattern to only match letters and a hyphen.
Details
^ - start of string
[A-Z] - an uppercase ASCII letter
[-a-zA-Z]* - zero or more ASCII letters / hyphens
(?:\s+[A-Z][-a-zA-Z]*)? - an optional (1 or 0 due to ? quantifier) sequence of:
\s+ - 1+ whitespace
[A-Z] - an uppercase ASCII letter
[-a-zA-Z]* - zero or more ASCII letters / hyphens
A Unicode aware equivalent (for the regex flavors supporting Unicode property classes):
^\p{Lu}[-\p{L}]*(?:\s+\p{Lu}[-\p{L}]*)?
where \p{L} matches any letter and \p{Lu} matches any uppercase letter.
This is probably simpler:
^([A-Z][-A-Za-z]+)(\s[A-Z][-A-Za-z]+)?
Replace + with * if you expect single-letter words.
If u need a Full name only (a two words with the first capitalize letters), this is a simple example:
^([A-Z][a-z]*)(\s)([A-Z][a-z]+)$
Try it. Enjoy!

Limitation of spaces in regex string

Total regex newbie here and I have been all over the place to try and find an answer. I need to add exactly 1 space followed by a string of letter characters (min 3 max 30) I have the following but it will accept more than 1 space which is the problem:
^[:blank:][A-z]{3,30}$
Any help with this would be great
[A-z] will also capture [, \, ], ^, _, `.
Use this regex to allow exactly 1 space in the beginning and then 3 to 30 English letters:
^[[:blank:]][a-zA-Z]{3,30}$
See demo.
To be unicode compatible:
^\p{Zs}\p{L}{3,30}$
Where \p{Zs} stands for a space character
and \p{L} stands for a letter.

RegEx - 1 to 10 Alphanumeric Spaces Okay

New to Regular Expressions. Thanks in advance!
Need to validate field is 1-10 mixed-case alphanumeric and spaces are allowed. First character must be alphanumeric, not space.
Good Examples:
"Larry King"
"L King1"
"1larryking"
"L"
Bad Example:
" LarryKing"
This is what I have and it does work as long as the data is exactly 10 characters. The problem is that it does not allow less than 10 characters.
[0-9a-zA-Z][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ][0-9a-zA-Z ]
I've read and tried many different things but am just not getting it.
Thank you,
Justin
I don't know what environment you are using and what engine. So I assume PCRE (typically for PHP)
this small regex does exact what you want: ^(?i)(?!\s)[a-z\d ]{1,10}$
What's going on?!
the ^ marks the start of the string (delete it, if the expression must not match the whole string)
the (?i) tells the engine to be case insensitive, so there's no need to write all letter lower and upper case in the expression later
the (?!\s) ensures the following char won't be a white space (\s) (it's a so called negative lookahead)
the [a-z\d ]{1,10} matches any letter (a-z), any digit (\d) and spaces () in a row with min 1 and max 10 occurances ({1,10})
the $ at the end marks the end of the string (delete it, if the expression must not match the whole string)
Here's also a small visualization for better understanding.
Debuggex Demo
Try this: [0-9a-zA-Z][0-9a-zA-Z ]{0,9}
The {x,y} syntax means between x and y times inclusive. {x,} means at least x times.
You want something like this.
[a-zA-Z0-9][a-zA-Z0-9 ]{0,9}
This first part ensures that it is alphanumeric. The second part gets your alphanumeric with a space. the {0,9} allows from anywhere from 0 to 9 occurrences of the second part. This will give your 1-10
Try this: ^[(^\s)a-zA-Z0-9][a-z0-9A-Z ]*
Not a space and alphanumeric for the first character, and then zero or more alphanumeric characters. It won't cap at 10 characters but it will work for any set of 1-10 characters.
The below is probably most semantically correct:
(?=^[0-9a-zA-Z])(?=.*[0-9a-zA-Z]$)^[0-9a-zA-Z ]{1,10}$
It asserts that the first and last characters are alphanumeric and that the entire string is 1 to 10 characters in length (including spaces).
I assume that the space is not allowed at the end too.
^[a-zA-Z0-9](?:[a-zA-Z0-9 ]{0,8}[a-zA-Z0-9])?$
or with posix character classes:
^[[:alnum:]](?:[[:alnum:] ]{0,8}[[:alnum:]])?$
i think the simplest way is to go with \w[\s\w]{0,9}
Note that \w is for [A-Za-z0-9_] so replace it by [A-Za-z0-9] if you don't want _
Note that \s is for any white char so replace it by if you don't want the others

How to write regular expression to match only numbers, letters and dashes?

I need an expression that will only accept:
numbers
normal letters (no special characters)
-
Spaces are not allowed either.
Example:
The regular expression should match:
this-is-quite-alright
It should not match
this -is/not,soålright
You can use:
^[A-Za-z0-9-]*$
This matches strings, possibly empty, that is wholly composed of uppercase/lowercase letters (ASCII A-Z), digits (ASCII 0-9), and a dash.
This matches (as seen on rubular.com):
this-is-quite-alright
and-a-1-and-a-2-and-3-4-5
yep---------this-is-also-okay
And rejects:
this -is/not,soålright
hello world
Explanation:
^ and $ are beginning and end of string anchors respectively
If you're looking for matches within a string, then you don't need the anchors
[...] is a character class
a-z, A-Z, 0-9 in a character class define ranges
- as a last character in a class is a literal dash
* is zero-or-more repetition
regular-expressions.info
Anchors, Character Class, Repetition
Variation
The specification was not clear, but if - is only to be used to separate "words", i.e. no double dash, no trailing dash, no preceding dash, then the pattern is more complex (only slightly!)
_"alpha"_ separating dash
/ \ /
^[A-Za-z0-9]+(-[A-Za-z0-9]+)*$
\__________/| \__________/|\
"word" | "word" | zero-or-more
\_____________/
group together
This matches strings that is at least one "word", where words consists of one or more "alpha", where "alpha" consists of letters and numbers. More "words" can follow, and they're always separated by a dash.
This matches (as seen on rubular.com):
this-is-quite-alright
and-a-1-and-a-2-and-3-4-5
And rejects:
--no-way
no-way--
no--way
[A-z0-9-]+
But your question is confusing as it asks for letters and numbers and has an example containing a dash.
This is a community wiki, an attempt to compile links to related questions about this "URL/SEO slugging" topic. Community is invited to contribute.
Related questions
regex/php: how can I convert 2+ dashes to singles and remove all dashes at the beginning and end of a string?
-this--is---a-test-- becomes this-is-a-test
Regex for [a-zA-Z0-9-] with dashes allowed in between but not at the start or end
allow spam123-spam-eggs-eggs1 reject eggs1-, -spam123, spam--spam
Translate “Lorem 3 ipsum dolor sit amet” into SEO friendly “Lorem-3-ipsum-dolor-sit-amet” in Java?
Related tags
[slug]