Regex - Why don't these two expressions produce the same result? [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm currently using this website to create some regular expressions for a programming language I want to build, at the moment I'm just setting up an expression for identifiers.
In my language, identifiers are expressed like most languages:
They cannot begin with a digit, or special character other than an underscore
After the first character they can contain alphanumeric and underscore characters
Given those rules I've come up with the following expression by myself:
^\D\w+$
Obviously, it doesn't account for special characters, however the following expression does (which I didn't make myself):
^(?!\d)\w+$
Why does the second expression account special characters? Shouldn't they be producing the same results?

I will explain why the second regex works.
The second regex uses a lookahead. After matching the start of the string, the engine checks whether the next character is a digit but it does not match it! This is important because if the next character is not a digit, it tries to use \w to match that same character, which it couldn't if the character is a symbol, if it is a digit, the negative lookahead fails and nothing is matched.
\D on the other hand, will match the character if it is not a digit, and \w will match whatever comes after that. That means all symbols are accepted.

This ^(?!\d)\w+$ means a string consisted of word characters [a-zA-Z0-9_] that doesn't start with a digit.
This ^\D\w+$ means a non-digit character followed by at least one character from [a-zA-Z0-9_] set.
So #ab01 is matched by second regex while first regex rejects it.

(?!\d)\w+ means "match a word which is not prepended with digits". But as you're wrapping it with ^ and $ characters it is basically the same as just ^\w+$ which is obviously not the same as ^\D\w+$. ^(?!\d).+\w+$ (note ".+" in the middle) would behave the same as ^\D\w+$

Related

What is the following token pattern matching: [A-Za-z0-9_]+(?=\\s+)

I understand the meaning of [A-Za-z0-9_]+ corresponding to a repeated sequence of one or more characters containing upper case letters, lower case letters, digits and underscores, but what does the whole expression corresponds to?
I'm going to assume that your regex is /[A-Za-z0-9_]+(?=\s+)/ and that your programming language requires you to escape the \ as \\.
Like you said, [A-Za-z0-9_]+ matches one or more alpha-numeric characters.
The (?=) pattern indicates a positive look ahead expression. We are checking if after the alpha-numeric characters, we have one or more(+) whitespace(\s) characters. However, the difference between /[A-Za-z0-9_]+\s+/ and /[A-Za-z0-9_]+(?=\s+)/ is that the former would include the whitespace in the match while the latter will not.
If you run your regex on this_is_followed_by_whitespace␠␠␠ where "␠" indicates spaces, only this_is_followed_by_whitespace will be matched. The expression is just looking ahead to check whether there is whitespace. Running /[A-Za-z0-9_]+\s+/ on the same string would match this_is_followed_by_whitespace␠␠␠.
Play around with your regex on this RegExr demo.

What is the meaning of this regular expression? ['`?!\"-/] [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
What is the meaning of this regular expression?
['`?!\"-/]
Why it matches parenthesis?
I used Java for development
In your regex
['`?!\"-/]
The quantity "-/ is being interpreted as a range of values, just as A-Z would mean taking every letter between A and Z. It turns out, by reading the basic ASCII table, that parentheses lie within this range, so your pattern is including them.
One trick you can use here with dash is to place it at the end:
['`?!\"/-]
^^^^ this will not be interpreted as a range
Because you didn't escape the dash -. The dash, inside a character class [] denotes a range of characters. In this case from " to /. And parentheses are between those, in ASCII.
The dash needs to be escaped \-, if it's not the first or last character, inside a character class, when you want it to be matched as a literal.
You have to use following
You need to escape -, otherwise, parentheses are matching.
Seems like "-/ will include parentheses as well. Like [A-C], which matches ASCII chars between A to C
[\'`?!\"\-/]
It will match following characters in a string.
'`?"-/
Check in the regex101

Perl code understanding

I am new to perl language - I have been trying to understand the below code
if ( $nextvalue !~ /^.+"[^ ]+ \/cs\/.+\sHTTP\/[1-9]\.[0-9]"|\/\/|\/Images\/fold\/1.jpg|\/busines|\/Type= OPTIONS|\/203.176.111.126/)
Can you please help us understand what is above meant for?
condition will be true when $nextvalue will NOT match following regular expression.
Regular expressiion will match if that string
either
starts with at least one character,
followed by double quote sign ("),
followed by at least one non-whitespace character,
followed by whitespace (),
followed by string "/cs/",
followed by at least one character,
followed by whitespace and string HTTP/,
followed by one of digits from 1 to 9 inclusive,
followed by dot
followed by one of digits from 0 to 9,
followed by double quote mark (")
or contains two forward slashes (//)
or contains sunstring "/Images/fold/1.jpg"
or contains substring "/busines"
or contains substring "/Type= OPTIONS"
or contains substring "/203.176.111.126"
Whenever i am unsure what some cryptic regular expression does, i turn to Debuggex:
^.+"[^ ]+ \/cs\/.+\sHTTP\/[1-9]\.[0-9]"|\/\/|\/Images\/fold\/1.jpg|\/busines|\/Type= OPTIONS|\/203.176.111.126
Debuggex Demo
This is a railroad diagram, every string that has a substring fitting the description along any of the grey tracks will match your regex. As your condition uses !~ meaning "does not match", those strings will then fail the check.
Debuggex certainly has issues (for example it displays ^, meaning you would have to know that this means the beginning of the string, same for dots and other, whitespaces show up as underscroes, etc.) but it certainly helps in understanding the structure of the expression and possibly gives you an idea what the author had in mind.

What is the regular expression to allow uppercase/lowercase (alphabetical characters), periods, spaces and dashes only?

I am having problems creating a regex validator that checks to make sure the input has uppercase or lowercase alphabetical characters, spaces, periods, underscores, and dashes only. Couldn't find this example online via searches. For example:
These are ok:
Dr. Marshall
sam smith
.george con-stanza .great
peter.
josh_stinson
smith _.gorne
Anything containing other characters is not okay. That is numbers, or any other symbols.
The regex you're looking for is ^[A-Za-z.\s_-]+$
^ asserts that the regular expression must match at the beginning of the subject
[] is a character class - any character that matches inside this expression is allowed
A-Z allows a range of uppercase characters
a-z allows a range of lowercase characters
. matches a period
rather than a range of characters
\s matches whitespace (spaces and tabs)
_ matches an underscore
- matches a dash (hyphen); we have it as the last character in the character class so it doesn't get interpreted as being part of a character range. We could also escape it (\-) instead and put it anywhere in the character class, but that's less clear
+ asserts that the preceding expression (in our case, the character class) must match one or more times
$ Finally, this asserts that we're now at the end of the subject
When you're testing regular expressions, you'll likely find a tool like regexpal helpful. This allows you to see your regular expression match (or fail to match) your sample data in real time as you write it.
Check out the basics of regular expressions in a tutorial. All it requires is two anchors and a repeated character class:
^[a-zA-Z ._-]*$
If you use the case-insensitive modifier, you can shorten this to
^[a-z ._-]*$
Note that the space is significant (it is just a character like any other).

Regex for alphanumeric, but at least one letter

In my ASP.NET page, I have an input box that has to have the following validation on it:
Must be alphanumeric, with at least one letter (i.e. can't be ALL
numbers).
^\d*[a-zA-Z][a-zA-Z0-9]*$
Basically this means:
Zero or more ASCII digits;
One alphabetic ASCII character;
Zero or more alphanumeric ASCII characters.
Try a few tests and you'll see this'll pass any alphanumeric ASCII string where at least one non-numeric ASCII character is required.
The key to this is the \d* at the front. Without it the regex gets much more awkward to do.
Most answers to this question are correct, but there's an alternative, that (in some cases) offers more flexibility if you want to change the rules later on:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]+)$
This will match any sequence of alphanumerical characters, but only if the first group also matches the whole sequence. It's a little-known trick in regular expressions that allows you to handle some very difficult validation problems.
For example, say you need to add another constraint: the string should be between 6 and 12 characters long. The obvious solutions posted here wouldn't work, but using the look-ahead trick, the regex simply becomes:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]{6,12})$
^[\p{L}\p{N}]*\p{L}[\p{L}\p{N}]*$
Explanation:
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
\p{L} matches one letter
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
^ and $ anchor the string, ensuring the regex matches the entire string. You may be able to omit these, depending on which regex matching function you call.
Result: you can have any alphanumeric string except there's got to be a letter in there somewhere.
\p{L} is similar to [A-Za-z] except it will include all letters from all alphabets, with or without accents and diacritical marks. It is much more inclusive, using a larger set of Unicode characters. If you don't want that flexibility substitute [A-Za-z]. A similar remark applies to \p{N} which could be replaced by [0-9] if you want to keep it simple. See the MSDN page on character classes for more information.
The less fancy non-Unicode version would be
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[A-Za-z][0-9A-Za-z]*$
is the regex that will do what you're after. The ^ and $ match the start and end of the word to prevent other characters. You could replace the [0-9A-z] block with \w, but i prefer to more verbose form because it's easier to extend with other characters if you want.
Add a regular expression validator to your asp.net page as per the tutorial on MSDN: http://msdn.microsoft.com/en-us/library/ms998267.aspx.
^\w*[\p{L}]\w*$
This one's not that hard. The regular expression reads: match a line starting with any number of word characters (letters, numbers, punctuation (which you might not want)), that contains one letter character (that's the [\p{L}] part in the middle), followed by any number of word characters again.
If you want to exclude punctuation, you'll need a heftier expression:
^[\p{L}\p{N}]*[\p{L}][\p{L}\p{N}]*$
And if you don't care about Unicode you can use a boring expression:
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[a-zA-Z][a-zA-Z0-9]*$
Can be
any number ended with a character,
or an alphanumeric expression started with a character
or an alphanumeric expression started with a number, followed by a character and ended with an alphanumeric subexpression