Regex - special characters and numbers - PHP and Javascript - regex

As I have hard time creating regex that would match letters only including accented characters (ie. Czech characters), I would like to go the other way around for my name validation - detect special characters and numbers.
What would be regex that matches special characters and numbers?

To specify #anubhava's, \w stands for [a-zA-Z0-9_] and capitalizing it negates the character class. If you want to match _ too, you'll have to make your own character class like [^a-zA-Z0-9] (everything but alphanumeric). Also this can be shortened to [^a-z\d] if you use the i modifier. Note, this would also match accented characters since they are not a-zA-Z0-9.
Example
However, I always advice against trying to use a "regular" expression to match a name (since names are not regular). See this blog post.

Related

Purpose of this dash character in Regex capture

I am trying to understand the purpose of - in this regex capture clause
(?P<slug>[\w-]+)
This is what I came up for when search for a dash
A dash (-) can be used to specify a range. So the dash is a
metacharacter, but only within a character class.If you want to use a
literal dash within a character class, you should escape it with a
backslash, except when the dash is the first or last character of the
character class. So, the regexp [a-z] is equal to [az-] and [-az],
they will match any of those three characters.
My questions is what is the - after \w
You are looking at what my former CS professor would refer to as a rabbit (out of a hat):
(?P<slug>[\w-]+)
The reason it is a rabbit is because normally your research is correct and dash is used as a part of a range of characters. But in this case, the dash is a literal dash, since it appears at the end of the character class.
So here [\w-]+ means to match one or more word characters or literal dashes.
If you want to include a literal dash in a character class, a safer way is to escape it:
[\w\-]+
Then, the dash may be placed anywhere in the class.

regex for hashtags

I have found a lot of regex examples to retrieve hashtags from a text. Unfortunately, none of examples is what I need.
This is almost what I need but...
function hashtags(text) {
return text.replace(/(^|\s)#(\w*[a-zA-Z]+\w{2,50})/g,
"$1<a href='/h/$2' target='_blank'>#$2</a>");
}
Hashtags can not start with a number to avoid situations when for example Section #12 gets hashtaged.
The example above checks it but it does not allow characters like ÁÉÍÚ, it does not check the hashtag lenghth correctly and it does not allow character '-'.
So, I need the following:
A hashtag may start with any letter - A,z,B,Ñ,ó,Ú etc, but not with a number and not with a special sign &%$ or - _
The total lenght of a hashtag must be 3-50 characters. The regex must accept as hashtags only full words but not to cut them after first 50 characters. So, words that start with # but contain more than 50 characters must be ignored instead of converting first 50 characters into a hashtag link. In my example {2,50} does not work correct.
The rest part of a hashtag (when checked that it does not start with a number or a special sign) may contain numbers, any letters and _ - signs. \w allows only _ but not -
Is it possible?
For 1 - you need a character class. You can define these with square brackets. PCRE defines \w but that includes numbers too.
For 2 - You can either have a word followed by 'some whitespace' (PCRE: \s) or use a look around pattern (?![A-Z0-9]) - for 'not followed by this.
And for 3 - non-whitespace may be what you want - \S in PCRE definitions.
/(?<!\w)#[A-Z]\S{1,49}(?!\w)/i
Demo
Edit: Given this may be javascript specific, and you can't use lookbehind, then the above may not work for you. If you are tying our regex query to a particular language, it is useful to specify that constraint in the question.
Try this one:
/(^|\s)#([^\d&%$_-]\S{2,49})\b/g
Explaining:
(^|\s) #
#([^\d&%$_-] # not the characters you mentioned in the first position
\S{2,49}) # the first chracter was already matched
\b # a boundary to avoid overflow 50 characters
Hope it helps.

What is the regular expression to allow uppercase/lowercase (alphabetical characters), periods, spaces and dashes only?

I am having problems creating a regex validator that checks to make sure the input has uppercase or lowercase alphabetical characters, spaces, periods, underscores, and dashes only. Couldn't find this example online via searches. For example:
These are ok:
Dr. Marshall
sam smith
.george con-stanza .great
peter.
josh_stinson
smith _.gorne
Anything containing other characters is not okay. That is numbers, or any other symbols.
The regex you're looking for is ^[A-Za-z.\s_-]+$
^ asserts that the regular expression must match at the beginning of the subject
[] is a character class - any character that matches inside this expression is allowed
A-Z allows a range of uppercase characters
a-z allows a range of lowercase characters
. matches a period
rather than a range of characters
\s matches whitespace (spaces and tabs)
_ matches an underscore
- matches a dash (hyphen); we have it as the last character in the character class so it doesn't get interpreted as being part of a character range. We could also escape it (\-) instead and put it anywhere in the character class, but that's less clear
+ asserts that the preceding expression (in our case, the character class) must match one or more times
$ Finally, this asserts that we're now at the end of the subject
When you're testing regular expressions, you'll likely find a tool like regexpal helpful. This allows you to see your regular expression match (or fail to match) your sample data in real time as you write it.
Check out the basics of regular expressions in a tutorial. All it requires is two anchors and a repeated character class:
^[a-zA-Z ._-]*$
If you use the case-insensitive modifier, you can shorten this to
^[a-z ._-]*$
Note that the space is significant (it is just a character like any other).

Regex to match all of a set except certain ones

I'm sure this has been asked before, but I can't seem to find it (or know the proper wording to search for)
Basically I want a regex that matches all non-alphanumeric except hyphens. So basically match \W+ except exclude '-' I'm not sure how to exclude specific ones from a premade set.
\W is a shorthand for [^\w]. So:
[^\w-]+
A bit of background:
[…] defines a set
[^…] negates a set
Generally, every \v (smallcase) set is negated by a \V (uppercase) where V is any letter that defines a set.
for international characters, you may want to look into [[:alpha:]] and [[:alnum:]]
[^\w-]+
will do just that. Match any characters not in the \w set except hyphen.
You can use:
[^a-zA-Z0-9_-]
or
[^\w-]
to match a single non-hyphen, non-alphanumeric. To match one or more of then prefix with a +
In Java7 or above, you need to prepend the (?U) to match all locale specific characters. e.g.
(?U)[^\w-]
In a Java string (you need to escape \ character with another one):
(?U)[^\\w-]

Regex for alphanumeric, but at least one letter

In my ASP.NET page, I have an input box that has to have the following validation on it:
Must be alphanumeric, with at least one letter (i.e. can't be ALL
numbers).
^\d*[a-zA-Z][a-zA-Z0-9]*$
Basically this means:
Zero or more ASCII digits;
One alphabetic ASCII character;
Zero or more alphanumeric ASCII characters.
Try a few tests and you'll see this'll pass any alphanumeric ASCII string where at least one non-numeric ASCII character is required.
The key to this is the \d* at the front. Without it the regex gets much more awkward to do.
Most answers to this question are correct, but there's an alternative, that (in some cases) offers more flexibility if you want to change the rules later on:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]+)$
This will match any sequence of alphanumerical characters, but only if the first group also matches the whole sequence. It's a little-known trick in regular expressions that allows you to handle some very difficult validation problems.
For example, say you need to add another constraint: the string should be between 6 and 12 characters long. The obvious solutions posted here wouldn't work, but using the look-ahead trick, the regex simply becomes:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]{6,12})$
^[\p{L}\p{N}]*\p{L}[\p{L}\p{N}]*$
Explanation:
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
\p{L} matches one letter
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
^ and $ anchor the string, ensuring the regex matches the entire string. You may be able to omit these, depending on which regex matching function you call.
Result: you can have any alphanumeric string except there's got to be a letter in there somewhere.
\p{L} is similar to [A-Za-z] except it will include all letters from all alphabets, with or without accents and diacritical marks. It is much more inclusive, using a larger set of Unicode characters. If you don't want that flexibility substitute [A-Za-z]. A similar remark applies to \p{N} which could be replaced by [0-9] if you want to keep it simple. See the MSDN page on character classes for more information.
The less fancy non-Unicode version would be
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[A-Za-z][0-9A-Za-z]*$
is the regex that will do what you're after. The ^ and $ match the start and end of the word to prevent other characters. You could replace the [0-9A-z] block with \w, but i prefer to more verbose form because it's easier to extend with other characters if you want.
Add a regular expression validator to your asp.net page as per the tutorial on MSDN: http://msdn.microsoft.com/en-us/library/ms998267.aspx.
^\w*[\p{L}]\w*$
This one's not that hard. The regular expression reads: match a line starting with any number of word characters (letters, numbers, punctuation (which you might not want)), that contains one letter character (that's the [\p{L}] part in the middle), followed by any number of word characters again.
If you want to exclude punctuation, you'll need a heftier expression:
^[\p{L}\p{N}]*[\p{L}][\p{L}\p{N}]*$
And if you don't care about Unicode you can use a boring expression:
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[a-zA-Z][a-zA-Z0-9]*$
Can be
any number ended with a character,
or an alphanumeric expression started with a character
or an alphanumeric expression started with a number, followed by a character and ended with an alphanumeric subexpression