Regular expressions \w without underscore in a character set [] - regex

This question is identical to Regular Expressions: How to Express \w Without Underscore, except that the goal is to match characters in the letter (L) general category plus a specified set of additional characters.
For example, [-$\a\d]+ would match identifiers like $gâteau-Noël-19 but not $gâteau_Noël-19, if a hypothetical \a letter class existed. But for some bizarre and incomprehensible reason it does not.
So the clumsy substitute suggested in the previous question, [^\W_], works fine as a replacement for \a by itself. But how can it be combined with additional characters to form the above regular expression?

Related

Regex - Why don't these two expressions produce the same result? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm currently using this website to create some regular expressions for a programming language I want to build, at the moment I'm just setting up an expression for identifiers.
In my language, identifiers are expressed like most languages:
They cannot begin with a digit, or special character other than an underscore
After the first character they can contain alphanumeric and underscore characters
Given those rules I've come up with the following expression by myself:
^\D\w+$
Obviously, it doesn't account for special characters, however the following expression does (which I didn't make myself):
^(?!\d)\w+$
Why does the second expression account special characters? Shouldn't they be producing the same results?
I will explain why the second regex works.
The second regex uses a lookahead. After matching the start of the string, the engine checks whether the next character is a digit but it does not match it! This is important because if the next character is not a digit, it tries to use \w to match that same character, which it couldn't if the character is a symbol, if it is a digit, the negative lookahead fails and nothing is matched.
\D on the other hand, will match the character if it is not a digit, and \w will match whatever comes after that. That means all symbols are accepted.
This ^(?!\d)\w+$ means a string consisted of word characters [a-zA-Z0-9_] that doesn't start with a digit.
This ^\D\w+$ means a non-digit character followed by at least one character from [a-zA-Z0-9_] set.
So #ab01 is matched by second regex while first regex rejects it.
(?!\d)\w+ means "match a word which is not prepended with digits". But as you're wrapping it with ^ and $ characters it is basically the same as just ^\w+$ which is obviously not the same as ^\D\w+$. ^(?!\d).+\w+$ (note ".+" in the middle) would behave the same as ^\D\w+$

Regex to match other than listed string

I need to select a value which not listed in following string including all special characters.
List of string and requirement that need to rejected:
XNIL
SNIL
All special characters
My expression is like this (?!XNIL|SNIL|[\W])\w+
The problem is, if my text have a word XNIL or SNIL, it still allow the word NIL. But i have listed the word XNIL and SNIL to be rejected. Any mistake did i made here?
You can check my regex online here -> http://regexr.com/3cdsl
This seems to work on your test page: (?!(XNIL|SNIL|\W+))\b\w+ At least it solves the XNIL/SNIL problem.
The reason why your regex was matching XNIL was it was matching from the \w+. To see why, take your original and change \w+ to \w and notice the difference.
UPDATE:
Based on your feedback, you also wish to exclude _.
Because _ is used in programming language symbols, and [arguably] regexes were created, of, by, and for programmers, _ is considered a "word" char (i.e. it's in \w and therefore not excluded by \W).
From the [perl] regex man page:
\w Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks)
Your final regex might need to be: (?!(XNIL|SNIL|_+|\W+))\b\w+. (Note: the _+)
A cleaner way: (?!(XNIL|SNIL|[\W_]+))\b\w+ which produces the same results yet is closer in intent to what you wanted.
You may have to adjust \w+ accordingly as well
If you really want to be sure, at the expense of being slightly more verbose, write out the character class as you choose:
(?!(XNIL|SNIL|[^a-zA-Z0-9]+))\b[a-zA-Z0-9]+
Check this regex
[^(XNIL|SNIL|[^\w])]
Explanation
[] having ^ at beginning says the that any thing that is not there in the list given in [] should be matched.
(XNIL|SNIL|[^\w+]) matches words XNIL or SNIL or [^\w] matches anything other than words(i.e. special chars)
So the whole regex matches any thing that is not there in [^(XNIL|SNIL|[^\w])]
This should work
(?m)^(((?!XNIL|SNIL|[\W]).)*)$
Grouping the character match with the negative lookahead will cause the zero length assertion to continue until finished (in this case at the end of the string due to $)

Regex - special characters and numbers - PHP and Javascript

As I have hard time creating regex that would match letters only including accented characters (ie. Czech characters), I would like to go the other way around for my name validation - detect special characters and numbers.
What would be regex that matches special characters and numbers?
To specify #anubhava's, \w stands for [a-zA-Z0-9_] and capitalizing it negates the character class. If you want to match _ too, you'll have to make your own character class like [^a-zA-Z0-9] (everything but alphanumeric). Also this can be shortened to [^a-z\d] if you use the i modifier. Note, this would also match accented characters since they are not a-zA-Z0-9.
Example
However, I always advice against trying to use a "regular" expression to match a name (since names are not regular). See this blog post.

What is the regular expression to allow uppercase/lowercase (alphabetical characters), periods, spaces and dashes only?

I am having problems creating a regex validator that checks to make sure the input has uppercase or lowercase alphabetical characters, spaces, periods, underscores, and dashes only. Couldn't find this example online via searches. For example:
These are ok:
Dr. Marshall
sam smith
.george con-stanza .great
peter.
josh_stinson
smith _.gorne
Anything containing other characters is not okay. That is numbers, or any other symbols.
The regex you're looking for is ^[A-Za-z.\s_-]+$
^ asserts that the regular expression must match at the beginning of the subject
[] is a character class - any character that matches inside this expression is allowed
A-Z allows a range of uppercase characters
a-z allows a range of lowercase characters
. matches a period
rather than a range of characters
\s matches whitespace (spaces and tabs)
_ matches an underscore
- matches a dash (hyphen); we have it as the last character in the character class so it doesn't get interpreted as being part of a character range. We could also escape it (\-) instead and put it anywhere in the character class, but that's less clear
+ asserts that the preceding expression (in our case, the character class) must match one or more times
$ Finally, this asserts that we're now at the end of the subject
When you're testing regular expressions, you'll likely find a tool like regexpal helpful. This allows you to see your regular expression match (or fail to match) your sample data in real time as you write it.
Check out the basics of regular expressions in a tutorial. All it requires is two anchors and a repeated character class:
^[a-zA-Z ._-]*$
If you use the case-insensitive modifier, you can shorten this to
^[a-z ._-]*$
Note that the space is significant (it is just a character like any other).

Is "#" a special character in regular expressions?

I am working on an email filter and I have come across a list of regular expressions that are used to block all emails coming from senders that match a record in that list. While browsing through the list, I have discovered that all occurrences of the # character are escaped with a \.
Does the # mean anything special in regular expressions and needs to be escaped like so \#?
It's normally not a special character, but it doesn't hurt to escape it which is probably why many people do it, they just want to be safe (or they think it's a special character).
No, the # is not special character in regex.
The the \ can be use in this meaning
Pattern:
\Q...\E
Def
Matches the characters between \Q and \E literally, suppressing the meaning of special characters.
Example:
\Q+-/\E matches +-/