What does this regular expression mean? - regex

/\ATo\:\s+(.*)/
Also, how do you work it out, what's the approach?

In multi-line regular expressions, \A matches the start of the string (and \Z is end of string, while ^/$ matches the start/end of the string or the start/end of a line). In single line variants, you just use ^ and $ for start and end of string/line since there is no distinction.
To is literal, \: is an escaped :.
\s means whitespace and the + means one or more of the preceding "characters" (white space in this case).
() is a capturing group, meaning everything in here will be stored in a "register" that you can use. Hence, this is the meat that will be extracted.
.* simply means any non newline character ., zero or more times *.
So, what this regex will do is process a string like:
To: paxdiablo
Re: you are so cool!
and return the text paxdiablo.
As to how to learn how to work this out yourself, the Perl regex tutorial(a) is a good start, and then practise, practise, practise :-)
(a) You haven't actually stated which regex implementation you're using but most modern ones are very similar to Perl. If you can find a specific tutorial for your particular flavour, that would obviously be better.

\A is a zero-width assertion and means "Match only at beginning of string".
The regex reads: On a line beginning with "To:" followed by one or more whitespaces (\s), capture the remainder of the line ((.*)).

First, you need to know what the different character classes and quantifiers are. Character classes are the backslash-prefixed characters, \A from your regex, for instance. Quantifiers are for instance the +. There are several references on the internet, for instance this one.
Using that, we can see what happens by going left to right:
\A matches a beginning of the string.
To matches the text "To" literally
\: escapes the ":", so it loses it's special meaning and becomes "just a colon"
\s matches whitespace (space, tab, etc)
+ means to match the previous class one or more times, so \s+ means one or more spaces
() is a capture group, anything matched within the parens is saved for later use
. means "any character"
* is like the +, but zero or more times, so .* means any number of any characters
Taking that together, the regex will match a string beginning with "To:", then at least one space, and the anything, which it will save. So, with the string "To: JaneKealum", you'll be able to extract "JaneKealum".

You start from left and look for any escaped (ie \A) characters. The rest are normal characters. \A means the start of the input. So To: must be matched at the very beginning of the input. I think the : is escaped for nothing. \s is a character group for all spaces (tabs, spaces, possibly newlines) and the + that follows it means you must have one or more space characters. After that you capture all the rest of the line in a group (marked with ( )).
If the input was
To: progo#home
the capture group would contain "progo#home"

It matches To: at the beginning of the input, followed by at least one whitespace, followed by any number of characters as a group.

The initial and trailing / characters delimit the regular expression.
A \ inside the expression means to treat the following character specially or treat it as a literal if it normally has a special meaning.
The \A means match only at the beginning of a string.
To means match the literal "To"
\: means match a literal ':'. A colon is normally a literal and has no special meaning it can be given.
\s means match a whitespace character.
+ means match as many as possible but at least one of whatever it follows, so \s+ means match one or more whitespace characters.
The ( and ) define a group of characters that will be captured and returned by the expression evaluator.
And finally the . matches any character and the * means match as many as possible but can be zero. Therefore the (.*) will capture all characters to the end of the input string.
So therefore the pattern will match a string that starts "To:" and capture all characters that occur after the first succeeding non-whitespace character.
The only way to really understand these things is to go through them one bit at a time and check the meaning of each component.

Related

Simple Regex: match everything until the last dot

Just want to match every character up to but not including the last period
dog.jpg -> dog
abc123.jpg.jpg -> abc123.jpg
I have tried
(.+?)\.[^\.]+$
Use lookahead to assert the last dot character:
.*(?=\.)
Live demo.
This will do the trick
(.*)\.
Regex Demo
The first captured group contains the name. You can access it as $1 or \1 as per your language
Regular expressions are greedy by default. This means that when a regex pattern is capable of matching more characters, it will match more characters.
This is a good thing, in your case. All you need to do is match characters and then a dot:
.*\.
That is,
. # Match "any" character
* # Do the previous thing (.) zero OR MORE times (any number of times)
\ # Escape the next character - treat it as a plain old character
. # Escaped, just means "a dot".
So: being greedy by default, match any character AS MANY TIMES AS YOU CAN (because greedy) and then a literal dot.

Unexpected working of Negated Shorthand Character Classes

The Regular expression
/[\D\S]/
should match characters Which is not a digit or not whitespace
But When I test this expression in regexpal
It starts matching any character that's digit, whitespace
What i am doing wrong ?
\D = all characters except digits,
\S = all characters except whitespaces
[\D\S] = union (set theory) of the above character groups = all characters.
Why? Because \D contains \s and \S contains \d.
If you want to match characters which are not dights nor whitespaces you can use [^\d\s].
Your regex is invalidating itself as it goes. Putting the regex inside of [] means it has to match one of the items inside of it. These two items override each other, which end up matching everything. In theory, anything that is non digit, would match every other char. available, and any non whitespace matches any digit and any other char. as well.
You can try using [^\d\s] which says to negate the match of any digit or any space. Instead of having everything caught in the original regex, this negates the matching of both the \d and \s. You can see testing done with it here.

Regex not stopping at first space

Trying to create a pattern that matches an opening bracket and gets everything between it and the next space it encounters.
I thought \[.*\s would achieve that, but it gets everything from the first opening bracket on. How can I tell it to break at the next space?
\[[^\s]*\s
The .* is a greedy, and will eat everything, including spaces, until the last whitespace character. If you replace it with \S* or [^\s]*, it will match only a chunk of zero or more characters other than whitespace.
Masking the opening bracket might be needed. If you negate the \s with ^\s, the expression should eat everything except spaces, and then a space, which means up to the first space.
You could use a reluctant qualifier:
[.*?\s
Or instead match on all non-space characters:
[\S*\s
Use this:
\[[^ ]*
This matches the opening bracket (\[) and then everything except space ([^ ]) zero or more times (*).
I suggest using \[\S*(?=\s).
\[: Match a [ character.
\S*: Match 0 or more non-space characters.
(?=\s): Match a space character, but don't include it in the pattern. This feature is called a zero-width positive look-ahead assertion and makes sure you pattern only matches if it is followed by a space, so it won't match at the end of line.
You might get away with \[\S*\s if you don't care about groups and want to include the final space, but you would have to clarify exactly which patterns need matching and which should not.
You want to replace . with [^\s], this would match "not space" instead of "anything" that . implies

Interpreting this piece of JavaScript

I'd like to understand what this line of JavaScript means...
(/^\w+, ?\w+, ?\w\.?$/)
i understand 'w stands for 'word', but need your help in understanding '/', '^', '+', '?', '.?$/'
Thank you..
That's a regular expression, not HTML.
It's inside of a regex literal (/.../) in Javascript.
^ matches the beginning of the string
\w matches any word character
+ matches one or more of the previous set.
? matches zero or one of the previous set (in this case a single space)
\. matches a .. (An unescaped . matches any single character)
$ matches the end of the string.
Let's break it down, because then it is easier to read:
^ beginning of the line
\w+ 1 or more 'word' characters
, a comma
? an optional space
\w+ 1 or more 'word' characters
, a comma
? an optional space
\w a single 'word' character
\.? an optional period
$ end of line
The meaning of a 'word' character is an alpha-numeric character or an underscore.
It is not HTML code but Regular Expression. Read more about it:
Regular expression
In computing, regular expressions,
also referred to as regex or regexp,
provide a concise and flexible means
for matching strings of text, such as
particular characters, words, or
patterns of characters. A regular
expression is written in a formal
language that can be interpreted by a
regular expression processor, a
program that either serves as a parser
generator or examines text and
identifies parts that match the
provided specification.
/^\w+, ?\w+, ?\w\.?$/
Outside in...
/ / delimiters
^ $ Matches the whole string (^ means to match the beginning, $ means to match the end)
One by one...
\w means word character (simply w doesn't match anything but the ASCII character w)
\w+ word characters (at least one, matches as much as possible)
? means the spaces are optional, matches 0 or 1 space character
. matches any character that is not a line break (can be configured with regex modifiers)
\. (like in the example) matches exactly one dot
It's a regular expression that looks for a string of word characters (like letters, digits, or underscores) that has two commas in it with an optional single space after each comma.

Understanding regex criteria in pattern match

I am trying to determine what the following pattern match criteria allows me to enter:
\s*([\w\.-]+)\s*=\s*('[^']*'|"[^"]*"|[^\s]+)
From my attempt to decipher (by looking at the regex's I do understand) it seems to say I can start with any character sequence then I must have a brace followed by alphanumerics, then another sequence followed by braces, one intial single quote, no backslashes closed by a brace ???
Sorry if I have got this completely muddled. Any help is appreciated.
Regards,
Pablo
The square brackets are character classes, and the parens are for grouping. I'm not sure what you mean by "braces".
This basically matches a name=value pair where than name consists of one or more "word", dot or hyphen characters, and the value is either a single quoted character or a double-quoted string of characters, or a bunch of non-whitespace characters. Single-quoted characters cannot contain a single quote, and double quoted strings may not contain double-quotes (both arguably minor flaws whatever syntax this is from). There's also arguably some ambiguity since the last option ("a bunch on non-whitespace characters") could match something starting with a single or double quote.
Also, zero or more whitespaces may appear around the equal sign or at the beginning (that's the \s* bits).
It's looking for strings of text which are basically
<identifier> = <value>
identifier is made up of letters, digits, '-' and '.'
value can be a single-quoted strings, double-quoted strings, or any other sequence of characters (as long as it doesn't contain a space).
So it would match lines that look like this:
foo = 1234
bar-bar= "a double-quoted string"
bar.foo-bar ='a single quoted string'
.baz =stackoverflow.com this part is ignored
Some things to note:
There's no way to put a quote inside a quoted string (such as using \" inside "...").
Anything after the quoted string is ignored.
If a quoted string isn't used for value, then everything from the first space onwards is ignored.
Whitespace is optional
RegexBuddy says:
\s*([\w\.-]+)\s*=\s*('[^']*'|"[^"]*"|[^\s]+)
Options: case insensitive
Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.) «\s*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the regular expression below and capture its match into backreference number 1 «([\w\.-]+)»
Match a single character present in the list below «[\w\.-]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A word character (letters, digits, etc.) «\w»
A . character «\.»
The character “-” «-»
Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.) «\s*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “=” literally «=»
Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.) «\s*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the regular expression below and capture its match into backreference number 2 «('[^']*'|"[^"]*"|[^\s]+)»
Match either the regular expression below (attempting the next alternative only if this one fails) «'[^']*'»
Match the character “'” literally «'»
Match any character that is NOT a “'” «[^']*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “'” literally «'»
Or match regular expression number 2 below (attempting the next alternative only if this one fails) «"[^"]*"»
Match the character “"” literally «"»
Match any character that is NOT a “"” «[^"]*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “"” literally «"»
Or match regular expression number 3 below (the entire group fails if this one fails to match) «[^\s]+»
Match a single character that is a “non-whitespace character” «[^\s]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Created with RegexBuddy
Let us break \s*([\w\.-]+)\s*=\s*('[^']*'|\"[^\"]*\"|[^\s]+) apart:
\s*([\w\.-]+)\s*:
\s* means 0 or more whitespace characters
`[\w.-]+ means 1 or more of the following characters: A-Za-z0-9_.-
('[^']*'|\"[^\"]*\"|[^\s]+):
One or more characters non-' characters enclosed in ' and '.
One or more characters non-" characters enclodes in " and ".
One or more characters not containing a space
So basically, you can mostly ignore the \s*'s in trying to understand the expression, they just handle removing spacing.
Yes, you have got it completely muddled. :P For one thing, there are no braces in that regex; that word usually refers to the curly brackets: {}. That regex only contains square brackets and parentheses (aka round brackets), and they're all regex metacharacters--they aren't meant to match those characters literally. The same goes for most of the other characters.
You might find this site useful. Very good tutorial and reference site for all things regex.