Purpose of this dash character in Regex capture - regex

I am trying to understand the purpose of - in this regex capture clause
(?P<slug>[\w-]+)
This is what I came up for when search for a dash
A dash (-) can be used to specify a range. So the dash is a
metacharacter, but only within a character class.If you want to use a
literal dash within a character class, you should escape it with a
backslash, except when the dash is the first or last character of the
character class. So, the regexp [a-z] is equal to [az-] and [-az],
they will match any of those three characters.
My questions is what is the - after \w

You are looking at what my former CS professor would refer to as a rabbit (out of a hat):
(?P<slug>[\w-]+)
The reason it is a rabbit is because normally your research is correct and dash is used as a part of a range of characters. But in this case, the dash is a literal dash, since it appears at the end of the character class.
So here [\w-]+ means to match one or more word characters or literal dashes.
If you want to include a literal dash in a character class, a safer way is to escape it:
[\w\-]+
Then, the dash may be placed anywhere in the class.

Related

Misuning simpler regex format?

So I am working on a fraction class for school and am using a regex pattern and matcher for user input. I found this online so i'll admit im not exactly sure what does what, but the following pattern finds each digit, middle operation, and allows spaces and tabs between all characters of the user's input(a fraction expression).
String fractionPattern = "\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*\\/\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*([-+*/])\\s*\\t*\\\\*t(\\d+)\\s*\\t*\\\\*t*\\/\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*";
I've tried researching java regex metacharacters and symbol meanings, but I am sort of struggling. Can someone offer me an explanation on each character? Or possibly a simpler way of accomplishing the same thing.
So you're looking to match fractions? Like "2 / 3" or whatever, and allow spaces? If so, you want to match (and capture) some string of digits, then a '/' character, then match and capture another string of digits. Without considering spaces, this is just:
"(\\d+)/(\\d+)"
\d in a regex matches any digit (0-9), but since we specify regexes with Strings, we have to escape the backslash itself. That is, the Java string literal "\" results in a String object with one backslash character.
The + means "one or more", and the parentheses capture it. Then the slash is just a literal slash.
To make it allow spaces, match and discard any space. \s matches any type of space (space, tab, newline), and again we have to escape the backslash:
"(\\d+)\\s*/\\s*(\\d+)"
* means "zero or more". So from left to right, this means:
One or more digits, which are captured
Zero or more spaces
A literal slash
Zero or more spaces
One or more digits, which are captured
To handle whitespace on the ends, either match space there, or
trim() the string first.
The regex pattern you posted actually matches any operation between two fractions.
There is a lot of noise in there with extra \t (tab) characters (which is redundant with \s for whitespace). Removing those and changing the double backslash to single backslash for readability, we get the following:
\s*(-?\d+)\s*\/\s*(-?\d+)\s*([-+*/])\s*(\d+)\s*\/\s*(-?\d+)\s*
(Double backslash is needed when the regex is in a String, so it is not treated as an escape sequence)
Let's break it down:
\s* means match 0 or more whitespace characters
-? means match one - character or none, for negative numbers
\d+ means match 1 or more numbers (0-9)
The parentheses around (-?\d+) means we can capture this first number and refer to it later if we want to.
\s* means 0 or more whitespace characters again
\/ means match / literally. Some languages / regex parsers require the backslash in front of a forward slash. Others (like Python) do not.
\s*(-?\d+)\s* is the same thing we just did with the first number, again: get another (possibly negative) number, potentially surrounded by whitespace, and capture it with parentheses for later use.
([-+/]) means match any of -, +, or /: the operation we want to perform between the two fractions. Parentheses are optional, only needed if you want to grab this character later.
\s*(\d+)\s*\/\s*(-?\d+)\s* is again what we had with the first fraction, except for some reason the possible negative sign is not included in the numerator, which is probably a mistake.
You can test it out here. (Regex101 is your friend.)

Notepad++ remove non alpanumeric characters

What is the best way to remove non alphanumeric characters from a text file using notepad++?
I only want to keep numbers and letters, Is there a built in feature to help or should I go the regex route?
I am trying to use this to keep them as well as spaces [a-zA-Z0-9 ]. It is working but I need to do the opposite!
In a Replace dialog window (Ctrl+H), use a negated character class in the Find What field:
[^a-zA-Z0-9\s]+
Here, [^ starts a negated character class that matches any character other than the one that belongs to the character set(s)/range(s) defined in it. So, the whole matches 1 or more chars other than ASCII letters, digits, and any whitespace.
Or, to make the expression Unicode-aware,
[^[:alnum:][:space:]]+
Here, [:alnum:] matches all alphanumeric chars and [:space:] matches all whitespace.

Regex: match all special characters but not *

Yet another question about a regex.
I'm trying to match all special characters, except '*'.
So if I match my regex against:
John%%%* dadidou
I should get:
John* dadidou
Here: How to match with regex all special chars except "-" in PHP?
The accepted answer advices to use (if I want to exclude '-'):
[^\w-]
But doesn't that mean: "NOT a special character, NOT -", which is a bit redundant ?
What you really want is this regex for matching:
[^\w\s*]+
Replace it by empty string.
Which means match 1 or more of any character that is:
Not a word character [AND]
Not a whitespace [AND]
Not a literal *
RegEx Demo
When you define a negative character class, you are really inverting it.
What does that mean ?
A positive character class implicitly OR's it's contents.
When you negate a class, you implicitly AND it's contents.
So, [\w-] means word OR dash,
the inverse, [^\w-] means not word AND not dash.
A negative word for instance, [^\w] would match a dash -.
So, to not match it, you have to add a not dash as well.
A C analogy would be
existing (varA || varB)
inverted (!varA && !varB)
where inverting changes the Boolean of each of the components.
Basically a negative class changes the Boolean of each of its components,
so the implicit OR becomes an implicit AND and the components characters
(or expressions) are negated.
What will really bake your noodle later on is when you see something like
[^\S\r\n]
This translates to NOT-NOT-Whitespace and NOT-cr and NOT-lf
which reduces to matching all whitespace except CR,LF

Regex - special characters and numbers - PHP and Javascript

As I have hard time creating regex that would match letters only including accented characters (ie. Czech characters), I would like to go the other way around for my name validation - detect special characters and numbers.
What would be regex that matches special characters and numbers?
To specify #anubhava's, \w stands for [a-zA-Z0-9_] and capitalizing it negates the character class. If you want to match _ too, you'll have to make your own character class like [^a-zA-Z0-9] (everything but alphanumeric). Also this can be shortened to [^a-z\d] if you use the i modifier. Note, this would also match accented characters since they are not a-zA-Z0-9.
Example
However, I always advice against trying to use a "regular" expression to match a name (since names are not regular). See this blog post.

What does \'.- mean in a Regular Expression

I'm new to regular expression and I having trouble finding what "\'.-" means.
'/^[A-Z \'.-]{2,20}$/i'
So far from my research, I have found that the regular expression starts (^) and requires two to twenty ({2,20}) alphabetical (A-Z) characters. The expression is also case insensitive (/i).
Any hints about what "\'.-" means?
The character class is the entire expression [A-Z \'.-], meaning any of A-Z, space, single quote, period, or hyphen. The \ is needed to protect the single quote, since it's also being used as the string quote. This charclass must be repeated 2 to 20 times, and because of the leading ^ and trailing $ anchors that must be the entire content of the matching string.
It means to escape the single quote (') that delmits the regex (as to not prematurely end the string), and then a . which means a literal . and a - which means a literal -.
Inside of the character range, the . is treated literally, and if the - isn't part of a valid range, e.g. a-z, then it is treated literally as well.
Your regex says Match the characters a-zA-Z '.- between 2 and 20 times as the entire string, with an optional trailing \n.
This regex is in a string. The backslash is there to escape the single quote so the string doesn't end early, in the middle of the regex. The dot and dash are just what they are, a period and a dash.
So, you were nearly right, except it's 2-20 characters that are letters, space, single quote, period, or dash.
It's quoting the quote.
The regular expression is ^[A-Z'.-]{2,20}$.
In the programming language you are using, you write it as a quoted string:
'SOMETHING'
To get a single quote in there, it's been backslashed.
Everything inside the square brackets is part of the character class, and will match a single character listed. In your example, the characters listed are the letters A through Z, a space, a single quote, a period, or a hyphen. (Note the hyphen must be listed last to avoid indicating a range, like A-Z.) Your full regular expression will match between 2 and 20 of the listed characters. The single quote is needed so the compiler knows you are not ending the string that defines the regular expression.
Some examples of things this will match:
....................
abaca af - .
AAfa- - ..
.z
And so on.