I'm not exactly a pro when it comes to regex and I have a PHP script that runs things through this regex:
^[\d\D]{1,}$
What does this supposed to do, it seems that it matches everything?
\d matches any digit
\D matches any non-digit.
[\d\D] matches all digits and non-digits.
{1,} asks for the match in [] to be repeated at least 1 time (with no upper limit).
So it matches everything with at least 1 character in it.
Reference: http://www.regular-expressions.info/reference.html
In short all that regex is doing is this:
^.+$
Which means match any character (digits OR non-digits) of 1 or greater length.
^[\d\D]{1,}$ will match a string which contains one or more {1,} of any digit \d or non-digit \D character including newline characters.
In contrast ^.+$ will match a string containing one or more of any character except newlines. If the singleline modifier was added to the regex, i.e. /^.+$/s then the . would also match any character including newlines.
[\d\D] is equivalent to using . in singleline mode, although more commonly [\s\S] is used with the same result.
+ is equivalent to {1,}.
The regex will match the whole of any string that contains at least one character of any kind.
You are right. In fact anything that is at least one character long. But in a kind of overcomplicated and pointless way. [\d\D] is equivalent to . and {1,} is equivalent to +
Related
I have the following regular expression for capturing positive & negative time offsets.
\b(?<sign>[\-\+]?)(?<hours>2[1-3]|[01][0-9]|[1-9]):(?<minutes>[0-5]\d)\b
It matches fine but the leading sign doesn't appear in the capture group. Am I formatting it wrong?
You can see the effect here https://regex101.com/r/CQxL8q/1/
That is because of the first \b. The \b word boundary does not match between a start of the string/newline and a - or + (i.e. a non-word char).
You need to move the word boundary after the optional sign group:
(?<sign>[-+]?)\b(?<hours>2[1-3]|[01][0-9]|[1-9]):(?<minutes>[0-5][0-9])\b
^^
See the regex demo.
Now, since the char following the word boundary is a digit (a word char) the word boundary will work correctly failing all matches where the digit is preceded with another word char.
The word boundary anchor (\b) matches the transition between a word character (letter, digit or underscore) to a non-word character or vice-versa. There is no such transition in -13:21.
The word boundary anchor could stay between the sign and the hours to avoid matching it in expressions that looks similar to a time (65401:23) but you cannot prevent it match 654:01:23 or 654-01:23.
As a side note [\-\+] is just a convoluted way to write [-+]. + does not have any special meaning inside a character class, there is no need to escape it. - is a special character inside a character class but not when it is the first or the last character (i.e. [- or -]).
Another remark: you use both [0-9] and \d in your regex. They denote the same thing1 but, for readability, it's recommended to stick to only one convention. Since other character classes that contain only digits are used, I would use [0-9] and not \d.
And some bugs in the regex fragment for hours: 2[1-3]|[01][0-9]|[1-9] do not match 0 (but it matches 00) and 20.
Given all the above corrections and improvements, the regex should be:
(?<sign>[-+]?)\b(?<hours>2[0-3]|[01][0-9]|[0-9]):(?<minutes>[0-5][0-9])\b
1 \d is the same as [0-9] when the Unicode flag is not set. When Unicode is enabled, \d also matches the digits in non-Latin based alphabets.
Just want to match every character up to but not including the last period
dog.jpg -> dog
abc123.jpg.jpg -> abc123.jpg
I have tried
(.+?)\.[^\.]+$
Use lookahead to assert the last dot character:
.*(?=\.)
Live demo.
This will do the trick
(.*)\.
Regex Demo
The first captured group contains the name. You can access it as $1 or \1 as per your language
Regular expressions are greedy by default. This means that when a regex pattern is capable of matching more characters, it will match more characters.
This is a good thing, in your case. All you need to do is match characters and then a dot:
.*\.
That is,
. # Match "any" character
* # Do the previous thing (.) zero OR MORE times (any number of times)
\ # Escape the next character - treat it as a plain old character
. # Escaped, just means "a dot".
So: being greedy by default, match any character AS MANY TIMES AS YOU CAN (because greedy) and then a literal dot.
I noticed some interesting behaviour with some regex work I am doing, and I'd like some insight.
From what I understand, the word character, \w should match the following [a-zA-Z_0-9]
Given this input,
0000000060399301+0000000042456971+0000000
What should this regex
(\d+)\w
Capture?
I would expect it to capture 0000000060399301 but it actually captures 000000006039930
Is there something I am missing? Why is the 1 dropped from the end?
I noticed if I changed the regex to
(\d+\w)
It captures correctly i.e. including the 1
Anyone care to explain? Thanks
You require the regex to match a trailing word character - that would be the 1.
It cannot be another character, because
+ is not a word class character
+ is not a digit
matching is greedy
\d+ - matches one or more digit characters.
\w+ - matches one or more word characters. [A-Za-z\d_]
So with this string 0000000060399301+, \d+ in this (\d+)\w regex matches all the digits (including the 1 before +) at very first, since the following pattern is \w , regex engine tries to find a match, so it backtracks one character to the left and forces \w to match the digit before + . Now the captured group contains 000000006039930 and the last 1 is matched by \w
The 1 is being dropped because \w isn't in the capture group.
I am using regular expression to validate a pattern followed by a fraction. I found these and they match what I need. Overall I want to match 1 to 2 numbers followed by the fraction. How are these expressions different?
/^[0-9]+(?:[\xbc\xbd\xbe])$/ugm
/^\d+(?:[\xbc\xbd\xbe])$/ugm
/^\w+(?:\w+)$/ugm
I need to match the following:
12½
1¼
11¾
but not match..
111½
11111¼
0¾
Well to begin with, [0-9] matches any character of: (0 to 9) and is not the same as \d
\d matches digits (0-9) and other digit characters such as Unicode.
\w matches any word character (letter, number, or underscore)
Although these given expressions may match the same pattern, you will eventually fail using your 3rd solution.
It will match a pattern like foobar where as you can see there are no (0-9) characters or Unicode fractions in this pattern.
And with running a quick benchmark, your 2nd solution is about 16% slower than your first, plus it matches Unicode and other digit characters.
I would stick with your first expression, and change it to match between 1-2 number characters.
/^[1-9][0-9]?(?:[\xbc\xbd\xbe])$/ugm
or even
/^[1-9][0-9]?(?:[\xbc-\xbe])$/ugm
Try the following:
^[1-9][0-9]?[\xbc\xbd\xbe]$
[0-9] and \d are equivalent. \w matches a "word" character. The expression [1-9] matches a digit which is not zero (since you specifically asked how to exclude that).
This unattractively hard-codes for some legacy 8-bit character set; for future compatibility, you should consider switching to Unicode.
You can try
/^[1-9][0-9]?(?:[\xbc\xbd\xbe])$/ugm
The Regular expression
/[\D\S]/
should match characters Which is not a digit or not whitespace
But When I test this expression in regexpal
It starts matching any character that's digit, whitespace
What i am doing wrong ?
\D = all characters except digits,
\S = all characters except whitespaces
[\D\S] = union (set theory) of the above character groups = all characters.
Why? Because \D contains \s and \S contains \d.
If you want to match characters which are not dights nor whitespaces you can use [^\d\s].
Your regex is invalidating itself as it goes. Putting the regex inside of [] means it has to match one of the items inside of it. These two items override each other, which end up matching everything. In theory, anything that is non digit, would match every other char. available, and any non whitespace matches any digit and any other char. as well.
You can try using [^\d\s] which says to negate the match of any digit or any space. Instead of having everything caught in the original regex, this negates the matching of both the \d and \s. You can see testing done with it here.