Unexpected working of Negated Shorthand Character Classes - regex

The Regular expression
/[\D\S]/
should match characters Which is not a digit or not whitespace
But When I test this expression in regexpal
It starts matching any character that's digit, whitespace
What i am doing wrong ?

\D = all characters except digits,
\S = all characters except whitespaces
[\D\S] = union (set theory) of the above character groups = all characters.
Why? Because \D contains \s and \S contains \d.
If you want to match characters which are not dights nor whitespaces you can use [^\d\s].

Your regex is invalidating itself as it goes. Putting the regex inside of [] means it has to match one of the items inside of it. These two items override each other, which end up matching everything. In theory, anything that is non digit, would match every other char. available, and any non whitespace matches any digit and any other char. as well.
You can try using [^\d\s] which says to negate the match of any digit or any space. Instead of having everything caught in the original regex, this negates the matching of both the \d and \s. You can see testing done with it here.

Related

Python regex match certain floating point numbers

I'm trying to match: 0 or more numbers followed by a dot followed by ( (0 or more numbers) but not (if followed by a d,D, or _))
Some examples and what should match/not:
match:
['1.0','1.','0.1','.1','1.2345']
not match:
['1d2','1.2d3','1._dp','1.0_dp','1.123165d0','1.132_dp','1D5','1.2356D6']
Currently i have:
"([0-9]*\.)([0-9]*(?!(d|D|_)))"
Which correctly matches everything in the match list. But for those in the things it should not match it incorrectly matches on:
['1.2d3','1.0_dp','1.123165d0','1.132_dp','1.2356D6']
and correctly does not match on:
['1d2','1._dp','1D5']
So it appears i have problem with the ([0-9]*(?!(d|D|_)) part which is trying to not match if there is a d|D|_ after the dot (with zero or more numbers in-between). Any suggestions?
Instead of using a negative lookahead, you might use a negated character class to match any character that is not in the character class.
If you only want to match word characters without the dD_ or a whitespace char you could use [^\W_Dd\s].
You might also remove the \W and \s to match all except dD_
^[0-9]*\.[^\W_Dd\s]*$
Explanation
^ Start of string
[0-9]*\. Match 0+ times a digit 0-9 followed by a dot
[^\W_Dd\s]* Negated character class, match 0+ times a word character without _ D d or whitespace char
$ End of string
Regex demo
If you don't want to use anchors to assert the start and the end of the string you could also use lookarounds to assert what is on the left and right is not a non whitspace char:
(?<!\S)[0-9]*\.[^\W_Dd\s]*(?!\S)
Regex demo
\d*[.](?!.*[_Dd]).* is what you are looking for:

Why is the special character not captured in the regex group

I have the following regular expression for capturing positive & negative time offsets.
\b(?<sign>[\-\+]?)(?<hours>2[1-3]|[01][0-9]|[1-9]):(?<minutes>[0-5]\d)\b
It matches fine but the leading sign doesn't appear in the capture group. Am I formatting it wrong?
You can see the effect here https://regex101.com/r/CQxL8q/1/
That is because of the first \b. The \b word boundary does not match between a start of the string/newline and a - or + (i.e. a non-word char).
You need to move the word boundary after the optional sign group:
(?<sign>[-+]?)\b(?<hours>2[1-3]|[01][0-9]|[1-9]):(?<minutes>[0-5][0-9])\b
^^
See the regex demo.
Now, since the char following the word boundary is a digit (a word char) the word boundary will work correctly failing all matches where the digit is preceded with another word char.
The word boundary anchor (\b) matches the transition between a word character (letter, digit or underscore) to a non-word character or vice-versa. There is no such transition in -13:21.
The word boundary anchor could stay between the sign and the hours to avoid matching it in expressions that looks similar to a time (65401:23) but you cannot prevent it match 654:01:23 or 654-01:23.
As a side note [\-\+] is just a convoluted way to write [-+]. + does not have any special meaning inside a character class, there is no need to escape it. - is a special character inside a character class but not when it is the first or the last character (i.e. [- or -]).
Another remark: you use both [0-9] and \d in your regex. They denote the same thing1 but, for readability, it's recommended to stick to only one convention. Since other character classes that contain only digits are used, I would use [0-9] and not \d.
And some bugs in the regex fragment for hours: 2[1-3]|[01][0-9]|[1-9] do not match 0 (but it matches 00) and 20.
Given all the above corrections and improvements, the regex should be:
(?<sign>[-+]?)\b(?<hours>2[0-3]|[01][0-9]|[0-9]):(?<minutes>[0-5][0-9])\b
1 \d is the same as [0-9] when the Unicode flag is not set. When Unicode is enabled, \d also matches the digits in non-Latin based alphabets.

Regular expression to match alphanumeric, hyphen, underscore and space string

I'm trying to match a string that contains alphanumeric, hyphen, underscore and space.
Hyphen, underscore, space and numbers are optional, but the first and last characters must be letters.
For example, these should all match:
abc
abc def
abc123
ab_cd
ab-cd
I tried this:
^[a-zA-Z0-9-_ ]+$
but it matches with space, underscore or hyphen at the start/end, but it should only allow in between.
Use a simple character class wrapped with letter chars:
^[a-zA-Z]([\w -]*[a-zA-Z])?$
This matches input that starts and ends with a letter, including just a single letter.
There is a bug in your regex: You have the hyphen in the middle of your characters, which makes it a character range. ie [9-_] means "every char between 9 and _ inclusive.
If you want a literal dash in a character class, put it first or last or escape it.
Also, prefer the use of \w "word character", which is all letters and numbers and the underscore in preference to [a-zA-Z0-9_] - it's easier to type and read.
Check this working in fiddle http://refiddle.com/refiddles/56a07cec75622d3ff7c10000
This will fix the issue
^[a-zA-Z]+[a-zA-Z0-9-_ ]*[a-zA-Z0-9]$
I tried using following regex:
/^\w+([\s-_]\w+)*$/
This allows alphanumeric, underscore, space and dash.
More details
As per your requirement of including space, hyphen, underscore and alphanumeric characters you can use \w shorthand character set for [a-zA-Z0-9_]. Escape the hyphen using \- as it usually used for character range inside character set.
To negate the space and hyphen at the beginning and end I have used [^\s\-].
So complete regex becomes [^\s\-][\w \-]+[^\s\-]
Here is the working demo.
You can use this regex:
^[a-zA-Z0-9]+(?:[\w -]*[a-zA-Z0-9]+)*$
RegEx Demo
This will only allow alphanumerics at start and end.

Regular expression with a set with a character followed by a character

I'm writing a regular expression in Java for capturing some word without spaces.
The word can contain only letter, number, hyphens and dot.
The character set [\w+\-\\.] work well.
Now I want to edit the set for allowing a single space after the dot.
How I have to edit my regular expression?
You can add an alternation that matches this additional requirement
([\w\-.]|(?<=\.) )+
See it here on Regexr
(?<=\.) is a lookbehind assertion. It ensures that space is only matched, if it is preceded by a dot.
Other hints:
\w contains the underscore and matches per default only ASCII letters/digits. If you care about Unicode, use either the modifier UNICODE_CHARACTER_CLASS to enable Unicode for \w or use the Unicode properties \p{L} and \p{Nd} to match Unicode letters and digits.
You don't need to escape the dot in a character class.
You have \w+ in your character class, are you aware, that you just add the "+" character to the accepted characters?
In case of a dot followed by a space, I suppose this pattern should be neither the first, nor the last in the matched string? You may want to enclose it in word boundaries \b:
([0-9A-Za-z-]|\b\.( \b)?)+
I deliberately did not use \w, to exclude underscores.
For allowing ONLY a single space after the dot you can use this regex:
^(?!.*?\. {2})[\w.-]+$
You don't need to escape dot OR hyphen inside character class
(?!.*?\. {2}) is a negative lookahead that disallows 2 or more spaces after a dot

What does this regular expression mean?

/\ATo\:\s+(.*)/
Also, how do you work it out, what's the approach?
In multi-line regular expressions, \A matches the start of the string (and \Z is end of string, while ^/$ matches the start/end of the string or the start/end of a line). In single line variants, you just use ^ and $ for start and end of string/line since there is no distinction.
To is literal, \: is an escaped :.
\s means whitespace and the + means one or more of the preceding "characters" (white space in this case).
() is a capturing group, meaning everything in here will be stored in a "register" that you can use. Hence, this is the meat that will be extracted.
.* simply means any non newline character ., zero or more times *.
So, what this regex will do is process a string like:
To: paxdiablo
Re: you are so cool!
and return the text paxdiablo.
As to how to learn how to work this out yourself, the Perl regex tutorial(a) is a good start, and then practise, practise, practise :-)
(a) You haven't actually stated which regex implementation you're using but most modern ones are very similar to Perl. If you can find a specific tutorial for your particular flavour, that would obviously be better.
\A is a zero-width assertion and means "Match only at beginning of string".
The regex reads: On a line beginning with "To:" followed by one or more whitespaces (\s), capture the remainder of the line ((.*)).
First, you need to know what the different character classes and quantifiers are. Character classes are the backslash-prefixed characters, \A from your regex, for instance. Quantifiers are for instance the +. There are several references on the internet, for instance this one.
Using that, we can see what happens by going left to right:
\A matches a beginning of the string.
To matches the text "To" literally
\: escapes the ":", so it loses it's special meaning and becomes "just a colon"
\s matches whitespace (space, tab, etc)
+ means to match the previous class one or more times, so \s+ means one or more spaces
() is a capture group, anything matched within the parens is saved for later use
. means "any character"
* is like the +, but zero or more times, so .* means any number of any characters
Taking that together, the regex will match a string beginning with "To:", then at least one space, and the anything, which it will save. So, with the string "To: JaneKealum", you'll be able to extract "JaneKealum".
You start from left and look for any escaped (ie \A) characters. The rest are normal characters. \A means the start of the input. So To: must be matched at the very beginning of the input. I think the : is escaped for nothing. \s is a character group for all spaces (tabs, spaces, possibly newlines) and the + that follows it means you must have one or more space characters. After that you capture all the rest of the line in a group (marked with ( )).
If the input was
To: progo#home
the capture group would contain "progo#home"
It matches To: at the beginning of the input, followed by at least one whitespace, followed by any number of characters as a group.
The initial and trailing / characters delimit the regular expression.
A \ inside the expression means to treat the following character specially or treat it as a literal if it normally has a special meaning.
The \A means match only at the beginning of a string.
To means match the literal "To"
\: means match a literal ':'. A colon is normally a literal and has no special meaning it can be given.
\s means match a whitespace character.
+ means match as many as possible but at least one of whatever it follows, so \s+ means match one or more whitespace characters.
The ( and ) define a group of characters that will be captured and returned by the expression evaluator.
And finally the . matches any character and the * means match as many as possible but can be zero. Therefore the (.*) will capture all characters to the end of the input string.
So therefore the pattern will match a string that starts "To:" and capture all characters that occur after the first succeeding non-whitespace character.
The only way to really understand these things is to go through them one bit at a time and check the meaning of each component.