I have this htaccess code
RewriteRule ^/([uge])/([^/]+)$ /$1/$2/
But I couldn't really understand what does [^/]+ do?
I've been searching this on Google for awhile, but I couldn't get what I wanted.
You have two basic regex constructs here
Character class
See character classes on regular-expressions.info
[...] is a character class, means this construct matches one character from the class (from inside the square brackets).
Your class starts with a ^, that gives the character class a special meaning, its a negated character class ([^...]), means matches anything thats not part of the class.
Quantifier
See quantifiers on regular-expressions.info
+ is a quantifier, meaning 1 or more
Meaning of your regex
To understand what this is doing you have also to take the next thing into account, the $ at the end. This is an anchor that matches the end of the string.
See anchors on regular-expressions.info
so ([^/]+)$ matches all characters at the end of the string that are not slashes.
Here you can also find a basic tutorial
[^/] means any character not matching /.
That means:
Match 1 or more characters until forward slash / is found
Anything in square brackets [ and ] that has caret ^ at the start acts has negation and hence:
[^/] means any character except /
[^/]+ means 1 or more characters except /
[any_character] is a Character Classes or Character Sets charclass Ref. [^any_character] is a negated Character Classes or Character Sets charclass negated Ref.
From Anchors Ref:
Remember ^ also has the meaning: The caret ^ matches the position before the first character in the string (an Anchor) when not used inside a Character class.
From charclass Ref: Metacharacters Inside Character Classes:
Note that the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (), the caret (^) and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. To search for a star or plus, use [+*]. Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.
From Repitition Ref
+ means one or more chracters.
so, [^/]+
Means match any character other than /. So, it will match until a / is encountered.
For ^/([uge])/([^/]+)$
the string should begin with /
followed by character u or g or e
followed by /
followed and ended by one or more any character other than /
the () (round brackets) are used for : Round Brackets Create a Backreference Ref
The expression [^/] matches any character that is not the /, and the quantor + denotes that the expression to the left of the quantor has to appear at leat one time.
Related
[^.]+\.(txt|html)
I am learning regex, and am trying to parse this.
[^.] The ^ means "not", and the dot is a wildcard that means any character, so this means find a match with "not any character"? I still don't understand this. Can anyone explain?
The plus is a Kleene Plus which means "1 or more". So now it's "one or more" "not any character".
I get \., it means a period.
(txt|html) means match with a txt file or html file. I think I understand everything after the plus sign. What I don't understand is why it doesn't look something the DOS equivalent where I can just do this: *.txt or *.(txt|html) where * means everything that ends in the file extension .txt or .html?
Is [^.] the equivalent of * in DOS?
The dot (.) has no special meaning when it's inside a character class, and doesn't require to be escaped.
[^.] means "any character that is not a literal . character". [^.]+ matches one or more occurrences of any character that is not a dot.
From regular-expressions.info:
In most regex flavors, the only special characters or meta-characters inside a character class are the closing bracket (]), the backslash (\), the caret (^), and the hyphen (-). The usual meta-characters are normal characters inside a character class, and do not need to be escaped by a backslash. Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.
. is not special inside [] character class. [^.]+ means one or more occurrences (+) of any character which is not a dot.
If you do *.txt it would not be valid regex as * would not get a character to repeat (zero or more times).
I was trying to understand about validating email in the following link -
http://www.w3schools.com/PHP/php_form_url_email.asp
I know that \w means alphanumeric characters i.e. [0-9a-zA-Z] and - should mean to include a "-" as well. I got confused because they have used it after the "." as well, I think that after "." only alphanumeric characters can appear such as "com" , "org" etc.
Regex 101
\w explained
\w match any word character [a-zA-Z0-9_]
\w\- explained
\w\-
\w match any word character [a-zA-Z0-9_]
\- matches the character - literally
Matching Email Addresses Simple, not future proof
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,6}\b
\w means [a-zA-Z0-9_]
and
\- means - (literal) in a character class.
Thus [\w\-] means [a-zA-Z0-9-]
note that escaping - in a character class is useless if it is at the first or last position.
A dot . in a regular expression matches any single character. In order for regex to match a dot, the dot has to be escaped: \.
It has been pointed out to me that inside square brackets [] a dot does not have to be escaped. For example, the expression:
[.]{3} would match ... string.
Doesn't it, really? And if so, is it true for all regex standards?
In a character class (square brackets) any character except ^, -, ] or \ is a literal.
This website is a brilliant reference and has lots of info on the nuances of different regex flavours.
http://www.regular-expressions.info/refcharclass.html
The Regular expression
/[\D\S]/
should match characters Which is not a digit or not whitespace
But When I test this expression in regexpal
It starts matching any character that's digit, whitespace
What i am doing wrong ?
\D = all characters except digits,
\S = all characters except whitespaces
[\D\S] = union (set theory) of the above character groups = all characters.
Why? Because \D contains \s and \S contains \d.
If you want to match characters which are not dights nor whitespaces you can use [^\d\s].
Your regex is invalidating itself as it goes. Putting the regex inside of [] means it has to match one of the items inside of it. These two items override each other, which end up matching everything. In theory, anything that is non digit, would match every other char. available, and any non whitespace matches any digit and any other char. as well.
You can try using [^\d\s] which says to negate the match of any digit or any space. Instead of having everything caught in the original regex, this negates the matching of both the \d and \s. You can see testing done with it here.
/\ATo\:\s+(.*)/
Also, how do you work it out, what's the approach?
In multi-line regular expressions, \A matches the start of the string (and \Z is end of string, while ^/$ matches the start/end of the string or the start/end of a line). In single line variants, you just use ^ and $ for start and end of string/line since there is no distinction.
To is literal, \: is an escaped :.
\s means whitespace and the + means one or more of the preceding "characters" (white space in this case).
() is a capturing group, meaning everything in here will be stored in a "register" that you can use. Hence, this is the meat that will be extracted.
.* simply means any non newline character ., zero or more times *.
So, what this regex will do is process a string like:
To: paxdiablo
Re: you are so cool!
and return the text paxdiablo.
As to how to learn how to work this out yourself, the Perl regex tutorial(a) is a good start, and then practise, practise, practise :-)
(a) You haven't actually stated which regex implementation you're using but most modern ones are very similar to Perl. If you can find a specific tutorial for your particular flavour, that would obviously be better.
\A is a zero-width assertion and means "Match only at beginning of string".
The regex reads: On a line beginning with "To:" followed by one or more whitespaces (\s), capture the remainder of the line ((.*)).
First, you need to know what the different character classes and quantifiers are. Character classes are the backslash-prefixed characters, \A from your regex, for instance. Quantifiers are for instance the +. There are several references on the internet, for instance this one.
Using that, we can see what happens by going left to right:
\A matches a beginning of the string.
To matches the text "To" literally
\: escapes the ":", so it loses it's special meaning and becomes "just a colon"
\s matches whitespace (space, tab, etc)
+ means to match the previous class one or more times, so \s+ means one or more spaces
() is a capture group, anything matched within the parens is saved for later use
. means "any character"
* is like the +, but zero or more times, so .* means any number of any characters
Taking that together, the regex will match a string beginning with "To:", then at least one space, and the anything, which it will save. So, with the string "To: JaneKealum", you'll be able to extract "JaneKealum".
You start from left and look for any escaped (ie \A) characters. The rest are normal characters. \A means the start of the input. So To: must be matched at the very beginning of the input. I think the : is escaped for nothing. \s is a character group for all spaces (tabs, spaces, possibly newlines) and the + that follows it means you must have one or more space characters. After that you capture all the rest of the line in a group (marked with ( )).
If the input was
To: progo#home
the capture group would contain "progo#home"
It matches To: at the beginning of the input, followed by at least one whitespace, followed by any number of characters as a group.
The initial and trailing / characters delimit the regular expression.
A \ inside the expression means to treat the following character specially or treat it as a literal if it normally has a special meaning.
The \A means match only at the beginning of a string.
To means match the literal "To"
\: means match a literal ':'. A colon is normally a literal and has no special meaning it can be given.
\s means match a whitespace character.
+ means match as many as possible but at least one of whatever it follows, so \s+ means match one or more whitespace characters.
The ( and ) define a group of characters that will be captured and returned by the expression evaluator.
And finally the . matches any character and the * means match as many as possible but can be zero. Therefore the (.*) will capture all characters to the end of the input string.
So therefore the pattern will match a string that starts "To:" and capture all characters that occur after the first succeeding non-whitespace character.
The only way to really understand these things is to go through them one bit at a time and check the meaning of each component.