Does regular expression \d match minus sign and/or decimal point? - regex

I'm look at some old PERL/CGI code to debug an issue and noticed a lot of uses of:
\d - Match non-digit character
\D - Match digit character
Most online docs mention that \d is the same as [0-9], which is what I've always thought of it as. But, I've also noticed Stackoverflow Questions that mention character set difference.
Does "\d" in regex mean a digit?
Does \d also match a minus sign and/or decimal point?
I'm off to do some testing.

Does \d also match a minus sign and/or decimal point?
NO

I don't know how Perl determine whether to use Unicode or ASCII or locale by default (no flag, no use). Regardless, by declaring use re '/a'; (ASCII), or use re '/u'; (Unicode), or use re '/l'; (locale), you will clearly signify to the Perl interpreter (and human reader) which mode you want to use and avoid unexpected behaviour.
Due to the effect of modifiers, \d has at least 2 meanings:
Under effect of /a flag (ASCII), \d will match digits from 0 to 9 (no more and no less).
Under effect of /u flag (Unicode), \d will match any decimal digit in any language, and is equivalent to \p{Digit}reference. This effectively makes \d+ pretty useless and dangerous to use, since it allows a mix of digits in any languages.
Quote from description of /u flag
And, \d+ , may match strings of digits that are a mixture from different writing systems, creating a security issue. num() in Unicode::UCD can be used to sort this out. Or the /a modifier can be used to force \d to match just the ASCII 0 through 9.
\d will not match any sign or punctuation, since those characters does not belong to Nd (Number, decimal digit) General Category of Unicode.

The answer is no. It merely does a digit check. However, Unicode makes things a bit more complex.
If you want to make sure something is a number -- a decimal number -- ake a look at the Scalar::Util module. One of the functions it has is look_like_number. This can be used to see if the string you're looking at could be a number or not, and works better than trying to use a regular expression.
This module has been part of standard Perl for a while, so you should have it on your system.

Related

Activate "char_classes" in boost regex library [duplicate]

How do I create a regular expression that detects hexadecimal numbers in a text?
For example, ‘0x0f4’, ‘0acdadecf822eeff32aca5830e438cb54aa722e3’, and ‘8BADF00D’.
How about the following?
0[xX][0-9a-fA-F]+
Matches expression starting with a 0, following by either a lower or uppercase x, followed by one or more characters in the ranges 0-9, or a-f, or A-F
The exact syntax depends on your exact requirements and programming language, but basically:
/[0-9a-fA-F]+/
or more simply, i makes it case-insensitive.
/[0-9a-f]+/i
If you are lucky enough to be using Ruby, you can do:
/\h+/
EDIT - Steven Schroeder's answer made me realise my understanding of the 0x bit was wrong, so I've updated my suggestions accordingly.
If you also want to match 0x, the equivalents are
/0[xX][0-9a-fA-F]+/
/0x[0-9a-f]+/i
/0x[\h]+/i
ADDED MORE - If 0x needs to be optional (as the question implies):
/(0x)?[0-9a-f]+/i
Not a big deal, but most regex engines support the POSIX character classes, and there's [:xdigit:] for matching hex characters, which is simpler than the common 0-9a-fA-F stuff.
So, the regex as requested (ie. with optional 0x) is: /(0x)?[[:xdigit:]]+/
It's worth mentioning that detecting an MD5 (which is one of the examples) can be done with:
[0-9a-fA-F]{32}
This will match with or without 0x prefix
(?:0[xX])?[0-9a-fA-F]+
If you're using Perl or PHP, you can replace
[0-9a-fA-F]
with:
[[:xdigit:]]
Just for the record I would specify the following:
/^[xX]?[0-9a-fA-F]{6}$/
Which differs in that it checks that it has to contain the six valid characters and on lowercase or uppercase x in case we have one.
Another example: Hexadecimal values for css colors start with a pound sign, or hash (#), then six characters that can either be a numeral or a letter between A and F, inclusive.
^#[0-9a-fA-F]{6}
If you are looking for an specific hex character in the middle of the string, you can use "\xhh" where hh is the character in hexadecimal. I've tried and it works. I use framework for C++ Qt but it can solve problems in other cases, depends on the flavor you need to use (php, javascript, python , golang, etc.).
This answer was taken from:http://ult-tex.net/info/perl/
This one makes sure you have no more than three valid pairs:
(([a-fA-F]|[0-9]){2}){3}
Any more or less than three pairs of valid characters fail to match.
In Java this is allowed:
(?:0x?)?[\p{XDigit}]+$
As you see the 0x is optional (even the x is optional) in a non-capturing group.
In case you need this within an input where the user can type 0 and 0x too but not a hex number without the 0x prefix:
^0?[xX]?[0-9a-fA-F]*$
first, instead of ^ and $ use \b as this is a word delimiter and can help when the hash is not the only string in the line.
i came here looking for similar but specialized regex and came up with this:
\b(\d+[a-f]+\d+[\da-f]*|[a-f]+\d+[a-f]+[\da-f]*)\b
I needed to detect hashes like git commit identifiers (and similar) in console and more then matching all possible hashes i prioritize NOT matching random words or numbers like EB or 12345678
So a heuristic approach i made is that I assume a hash will be alternating between numbers and letters reasonably often and the chains of only numbers or only letters will be short.
Another important fact is that MD5 hash is 32 characters long (as mentioned by #Adaddinsane) and git displays a shortened version with only 10 characters, so above example can be modified as follows:
for 10-char long hashes i assume the groups will be at most 3-char long
\b(\d+[a-f]+\d+[\da-f]{1,7}|[a-f]+\d+[a-f]+[\da-f]{1,7})\b
for up to 32-char long hashes i assume the groups will be at most 5-char long
\b(\d+[a-f]+\d+[\da-f]{17,29}|[a-f]+\d+[a-f]+[\da-f]{17,29})\b
you can easily change a-f to a-fA-F for case insensitivity or add 0[xX] at the front for that 0x prefix matching
those examples will obviously not match exotic but valid hashes that have very long sequences of only numbers or only letters in the front or extreme hashes like only 0s
but this way i can match hashes and reduce accident false-positive matches significantly, like dir name or line number

re compile error: sre_constants.error: bad character range [duplicate]

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");

How to include special chars in this regex

First of all I am a total noob to regular expressions, so this may be optimized further, and if so, please tell me what to do. Anyway, after reading several articles about regex, I wrote a little regex for my password matching needs:
(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(^[A-Z]+[a-z0-9]).{8,20}
What I am trying to do is: it must start with an uppercase letter, must contain a lowercase letter, must contain at least one number must contain at least on special character and must be between 8-20 characters in length.
The above somehow works but it doesn't force special chars(. seems to match any character but I don't know how to use it with the positive lookahead) and the min length seems to be 10 instead of 8. what am I doing wrong?
PS: I am using http://gskinner.com/RegExr/ to test this.
Let's strip away the assertions and just look at your base pattern alone:
(^[A-Z]+[a-z0-9]).{8,20}
This will match one or more uppercase Latin letters, followed by by a single lowercase Latin letter or decimal digit, followed by 8 to 20 of any character. So yes, at minimum this will require 10 characters, but there's no maximum number of characters it will match (e.g. it will allow 100 uppercase letters at the start of the string). Furthermore, since there's no end anchor ($), this pattern would allow any trailing characters after the matched substring.
I'd recommend a pattern like this:
^(?=.*[a-z])(?=.*[0-9])(?=.*[!##$])[A-Z]+[A-Za-z0-9!##$]{7,19}$
Where !##$ is a placeholder for whatever special characters you want to allow. Don't forget to escape special characters if necessary (\, ], ^ at the beginning of the character class, and- in the middle).
Using POSIX character classes, it might look like this:
^(?=.*[:lower:])(?=.*[:digit:])(?=.*[:punct:])[:upper:]+[[:alnum:][:punct:]]{7,19}$
Or using Unicode character classes, it might look like this:
^(?=.*[\p{Ll}])(?=.*\d)(?=.*[\p{P}\p{S}])[\p{Lu}]+[\p{L}\d\p{P}\p{S}]{7,19}$
Note: each of these considers a different set of 'special characters', so they aren't identical to the first pattern.
The following should work:
^(?=.*[a-z])(?=.*[0-9])(?=.*[^a-zA-Z0-9])[A-Z].{7,19}$
I removed the (?=.*[A-Z]) because the requirement that you must start with an uppercase character already covers that. I added (?=.*[^a-zA-Z0-9]) for the special characters, this will only match if there is at least one character that is not a letter or a digit. I also tweaked the length checking a little bit, the first step here was to remove the + after the [A-Z] so that we know exactly one character has been matched so far, and then changing the .{8,20} to .{7,19} (we can only match between 7 and 19 more characters if we already matched 1).
Well, here is how I would write it, if I had such requirements - excepting situations where it's absolutely not possible or practical, I prefer to break up complex regular expressions. Note that this is English-specific, so a Unicode or POSIX character class (where supported) may make more sense:
/^[A-Z]/ && /[a-z]/ && /[1-9]/ && /[whatever special]/ && ofCorrectLength(x)
That is, I would avoid trying to incorporate all the rules at once.

URL Regular Expression matching exact 3 characters after decimal

I require regular expression to match exactly 3 or 2 characters after decimal point, so that it validates www.xyz.com and not xyz.Complete
I think what you want is \b
I can't think of a case that's not reasonably covered by using the word-boundary assertion \b any of the other answers need only have \b at the end (if it's always .com, then you'd use .com\b which means essentially a literal dot (.) character followed by com, where whatever follows is something other than a letter, number or underscore. It's a zero-width assertion, which means it will not capture anything. To allow a .net or .edu as well, you would use \.(com|edu|net)\b
The \b assertion is supported in most tools and languages using regexes, but if you need to get more precise (for instance, you might want to allow an underscore after com), your tool or language compiler may support "lookaheads" which are also zero-width assertions. (in the instance mentioned just above, you would use something like \.(com|net|edu|org|mil|museum)(?![a-zA-Z0-9]) which would prohibit numbers and uppercase or lowercase letters)
Strictly answering your question of
match exactly 3 or 2 characters after decimal point
To match just the ending:
\.[A-Za-z]{2,3}$
the \ escapes the . which otherwise means "any character"
You forgot the string beginning and ending checks (^, $). Use this:
^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$

How to write this regular expression in Lua?

I'm new to the Lua regex equivalence features, I need to write the following regular expression, which should match numbers with decimals
\b[0-9]*.\b[0-9]*(?!])
Basically, it matches numbers in decimal format (eg: 1, 1.1, 0.1, 0.11), which do not end with ']', I've been trying to write a regex like this with Lua using string.gmatch, but I'm quite inexperienced with Lua matching expressions...
Thanks!
Lua does not have regular expressions, mainly because a full regular expression library would be bigger than Lua itself.
What Lua has instead are matching patterns, which are way less powerful (but still sufficient for many use cases):
There is no "word boundary" matcher,
no alternatives,
and also no lookahead or similar.
I think there is no Lua pattern which would match every possible occurrence of your string, and no other one, which means that you somehow must work around this.
The pattern proposed by Stuart, %d*%.?%d*, matches all decimal numbers (with or without a dot), but it also matches the empty string, which is not quite useful. %d+%.?%d* matches all decimal numbers with at least one digit before the dot (or without a dot), %d*%d.?%d+ matches all decimal numbers with at least one digit after the dot (or without a dot). %.%d+ matches decimal numbers without a digit before the dot.
A simple solution would be to search more than one of these patterns (for example, both %d+%.?%d* and %.%d+), and combine the results. Then look at the places where you found them and look if there is a ']' following them.
I experimented a bit with the frontier pattern.
The pattern %f[%.%d]%d*%.?%d*%f[^%.%d%]] matches all decimal numbers which are preceded by something that is neither digit nor dot (or by nothing), and followed by something that is neither ] nor digit nor dot (or by nothing). It also matches the single dot, though.
"%d*%.?%d+" will match all such numbers in decimal format (note that that's going to miss any signed numbers such as -1.1 or +3.14). You'll need to come up with another solution to avoid instances that end with ], such as removing them from the string before looking for the numbers:
local pattern = "%d*%.?%d+"
local clean = string.gsub(orig ,pattern .. "%]", "")
return string.gmatch(clean, pattern)