What is the meaning of [\w\-] regular expression in PHP - regex

I was trying to understand about validating email in the following link -
http://www.w3schools.com/PHP/php_form_url_email.asp
I know that \w means alphanumeric characters i.e. [0-9a-zA-Z] and - should mean to include a "-" as well. I got confused because they have used it after the "." as well, I think that after "." only alphanumeric characters can appear such as "com" , "org" etc.

Regex 101
\w explained
\w match any word character [a-zA-Z0-9_]
\w\- explained
\w\-
\w match any word character [a-zA-Z0-9_]
\- matches the character - literally
Matching Email Addresses Simple, not future proof
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,6}\b

\w means [a-zA-Z0-9_]
and
\- means - (literal) in a character class.
Thus [\w\-] means [a-zA-Z0-9-]
note that escaping - in a character class is useless if it is at the first or last position.

Related

Why is the special character not captured in the regex group

I have the following regular expression for capturing positive & negative time offsets.
\b(?<sign>[\-\+]?)(?<hours>2[1-3]|[01][0-9]|[1-9]):(?<minutes>[0-5]\d)\b
It matches fine but the leading sign doesn't appear in the capture group. Am I formatting it wrong?
You can see the effect here https://regex101.com/r/CQxL8q/1/
That is because of the first \b. The \b word boundary does not match between a start of the string/newline and a - or + (i.e. a non-word char).
You need to move the word boundary after the optional sign group:
(?<sign>[-+]?)\b(?<hours>2[1-3]|[01][0-9]|[1-9]):(?<minutes>[0-5][0-9])\b
^^
See the regex demo.
Now, since the char following the word boundary is a digit (a word char) the word boundary will work correctly failing all matches where the digit is preceded with another word char.
The word boundary anchor (\b) matches the transition between a word character (letter, digit or underscore) to a non-word character or vice-versa. There is no such transition in -13:21.
The word boundary anchor could stay between the sign and the hours to avoid matching it in expressions that looks similar to a time (65401:23) but you cannot prevent it match 654:01:23 or 654-01:23.
As a side note [\-\+] is just a convoluted way to write [-+]. + does not have any special meaning inside a character class, there is no need to escape it. - is a special character inside a character class but not when it is the first or the last character (i.e. [- or -]).
Another remark: you use both [0-9] and \d in your regex. They denote the same thing1 but, for readability, it's recommended to stick to only one convention. Since other character classes that contain only digits are used, I would use [0-9] and not \d.
And some bugs in the regex fragment for hours: 2[1-3]|[01][0-9]|[1-9] do not match 0 (but it matches 00) and 20.
Given all the above corrections and improvements, the regex should be:
(?<sign>[-+]?)\b(?<hours>2[0-3]|[01][0-9]|[0-9]):(?<minutes>[0-5][0-9])\b
1 \d is the same as [0-9] when the Unicode flag is not set. When Unicode is enabled, \d also matches the digits in non-Latin based alphabets.

Regular expression to match alphanumeric, hyphen, underscore and space string

I'm trying to match a string that contains alphanumeric, hyphen, underscore and space.
Hyphen, underscore, space and numbers are optional, but the first and last characters must be letters.
For example, these should all match:
abc
abc def
abc123
ab_cd
ab-cd
I tried this:
^[a-zA-Z0-9-_ ]+$
but it matches with space, underscore or hyphen at the start/end, but it should only allow in between.
Use a simple character class wrapped with letter chars:
^[a-zA-Z]([\w -]*[a-zA-Z])?$
This matches input that starts and ends with a letter, including just a single letter.
There is a bug in your regex: You have the hyphen in the middle of your characters, which makes it a character range. ie [9-_] means "every char between 9 and _ inclusive.
If you want a literal dash in a character class, put it first or last or escape it.
Also, prefer the use of \w "word character", which is all letters and numbers and the underscore in preference to [a-zA-Z0-9_] - it's easier to type and read.
Check this working in fiddle http://refiddle.com/refiddles/56a07cec75622d3ff7c10000
This will fix the issue
^[a-zA-Z]+[a-zA-Z0-9-_ ]*[a-zA-Z0-9]$
I tried using following regex:
/^\w+([\s-_]\w+)*$/
This allows alphanumeric, underscore, space and dash.
More details
As per your requirement of including space, hyphen, underscore and alphanumeric characters you can use \w shorthand character set for [a-zA-Z0-9_]. Escape the hyphen using \- as it usually used for character range inside character set.
To negate the space and hyphen at the beginning and end I have used [^\s\-].
So complete regex becomes [^\s\-][\w \-]+[^\s\-]
Here is the working demo.
You can use this regex:
^[a-zA-Z0-9]+(?:[\w -]*[a-zA-Z0-9]+)*$
RegEx Demo
This will only allow alphanumerics at start and end.

Need help understanding this particular regular expression [^.]

[^.]+\.(txt|html)
I am learning regex, and am trying to parse this.
[^.] The ^ means "not", and the dot is a wildcard that means any character, so this means find a match with "not any character"? I still don't understand this. Can anyone explain?
The plus is a Kleene Plus which means "1 or more". So now it's "one or more" "not any character".
I get \., it means a period.
(txt|html) means match with a txt file or html file. I think I understand everything after the plus sign. What I don't understand is why it doesn't look something the DOS equivalent where I can just do this: *.txt or *.(txt|html) where * means everything that ends in the file extension .txt or .html?
Is [^.] the equivalent of * in DOS?
The dot (.) has no special meaning when it's inside a character class, and doesn't require to be escaped.
[^.] means "any character that is not a literal . character". [^.]+ matches one or more occurrences of any character that is not a dot.
From regular-expressions.info:
In most regex flavors, the only special characters or meta-characters inside a character class are the closing bracket (]), the backslash (\), the caret (^), and the hyphen (-). The usual meta-characters are normal characters inside a character class, and do not need to be escaped by a backslash. Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.
. is not special inside [] character class. [^.]+ means one or more occurrences (+) of any character which is not a dot.
If you do *.txt it would not be valid regex as * would not get a character to repeat (zero or more times).

Regular expression with a set with a character followed by a character

I'm writing a regular expression in Java for capturing some word without spaces.
The word can contain only letter, number, hyphens and dot.
The character set [\w+\-\\.] work well.
Now I want to edit the set for allowing a single space after the dot.
How I have to edit my regular expression?
You can add an alternation that matches this additional requirement
([\w\-.]|(?<=\.) )+
See it here on Regexr
(?<=\.) is a lookbehind assertion. It ensures that space is only matched, if it is preceded by a dot.
Other hints:
\w contains the underscore and matches per default only ASCII letters/digits. If you care about Unicode, use either the modifier UNICODE_CHARACTER_CLASS to enable Unicode for \w or use the Unicode properties \p{L} and \p{Nd} to match Unicode letters and digits.
You don't need to escape the dot in a character class.
You have \w+ in your character class, are you aware, that you just add the "+" character to the accepted characters?
In case of a dot followed by a space, I suppose this pattern should be neither the first, nor the last in the matched string? You may want to enclose it in word boundaries \b:
([0-9A-Za-z-]|\b\.( \b)?)+
I deliberately did not use \w, to exclude underscores.
For allowing ONLY a single space after the dot you can use this regex:
^(?!.*?\. {2})[\w.-]+$
You don't need to escape dot OR hyphen inside character class
(?!.*?\. {2}) is a negative lookahead that disallows 2 or more spaces after a dot

Why was \b introduced while \s match string borders in regular expressions?

I see that there is the \b which I have never used and I was wondering if someone can give me use cases when it is not possible to do without \b.
I was wondering if someone can give me use cases when it is not possible to do without \b.
The expression \b is just a convenient shorthand for what you can already do with other constructs.
For example, if your regular expression engine has lookarounds then \b is equivalent to the following longer expression:
(?<=\w)(?!\w)|(?<!\w)(?=\w)
Similarly \w, \d, etc. are just shorthand for what already can be done using character classes, for example [A-Za-z0-9_] or [0-9]. You typically want to use the short version because writing out the full definition each time is cumbersome, harder to read and increases the risk of making an error.
They match on different things - \s matches on whitespace, \b on word boundaries.
One good example is the character ..
In the string hello.hi:
\s will not match ., but \b will match before and after it.
They are completely different things.
\s is a "whitespace character". That means it is a shortcut to a predefined character class that contains whitespace characters like \t, \r, \n or a space. \s matches one out of those characters.
\b is a "word boundary". It is a zero width assertion and is related to the predefined character class \w. Zero width assertion means, it has a width of 0, i.e. it does not match a character. It does match a position that fulfills an assertion. The assertion here would be a word character on one side and a non-word character on the other side. Mark provided already the long version of \b and Oded an example where \b would match.
\w is a "word character", means it contains something like [a-zA-Z0-9_]. In some languages it is based on Unicode and contains all letters.