Regex matching exact pattern (full word with or without apostrophe) and not partial components of a word [duplicate] - regex

This question already has answers here:
Matching Unicode letter characters in PCRE/PHP
(5 answers)
Closed 2 years ago.
I am trying to write a regex to match full words with or without an apostrophe.
I did this:
\b[a-zA-Z']+\b
However, it is matching the letters in bold Jönas while the desired is to not match the word Jönas at all because of the ö on it.
The right matches should go for anything in a-zA-Z'
Thus following cases should match in full:
Jonas
Don't
hasn'T
But not for:
Jönas
Dön't
Hélló
demo here: https://regex101.com/r/2sVN5S/1/ (where Jönas and Hélton should not be matched at all not even partially)
How to fix the regex, to follow this exact match?

UPDATE. Anubhava and Wiktor Stribiżew pointed out that using \b[a-zA-Z']+\b in Unicode mode is enough (fiddle 1 and fiddle 2).
As said Wiktor, there is no use case the answer below is relevant (no engine supports look-around groups while not supporting Unicode mode). So this answer isn't anymore relevant.
You can use this regex:
\b(?<![\x80-\xFF])[a-zA-Z']+(?![\x80-\xFF])\b
Here, [\x80-\xFF] stands for a range of character codes above ASCII 7bit set (where non-english letters lies). Basically, it looks for:
a sequence of english letters with or without apostrophes ...
not preceded by non-english letters (negative look-before group (?<!...)
not followed by non-english letters (negative look-ahead group (?!...)
Working Regex101.com sample.

Related

Regex - Why don't these two expressions produce the same result? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I'm currently using this website to create some regular expressions for a programming language I want to build, at the moment I'm just setting up an expression for identifiers.
In my language, identifiers are expressed like most languages:
They cannot begin with a digit, or special character other than an underscore
After the first character they can contain alphanumeric and underscore characters
Given those rules I've come up with the following expression by myself:
^\D\w+$
Obviously, it doesn't account for special characters, however the following expression does (which I didn't make myself):
^(?!\d)\w+$
Why does the second expression account special characters? Shouldn't they be producing the same results?
I will explain why the second regex works.
The second regex uses a lookahead. After matching the start of the string, the engine checks whether the next character is a digit but it does not match it! This is important because if the next character is not a digit, it tries to use \w to match that same character, which it couldn't if the character is a symbol, if it is a digit, the negative lookahead fails and nothing is matched.
\D on the other hand, will match the character if it is not a digit, and \w will match whatever comes after that. That means all symbols are accepted.
This ^(?!\d)\w+$ means a string consisted of word characters [a-zA-Z0-9_] that doesn't start with a digit.
This ^\D\w+$ means a non-digit character followed by at least one character from [a-zA-Z0-9_] set.
So #ab01 is matched by second regex while first regex rejects it.
(?!\d)\w+ means "match a word which is not prepended with digits". But as you're wrapping it with ^ and $ characters it is basically the same as just ^\w+$ which is obviously not the same as ^\D\w+$. ^(?!\d).+\w+$ (note ".+" in the middle) would behave the same as ^\D\w+$

Matching a substring of n numbers, but not if there are any numbers after that [duplicate]

This question already has answers here:
Java RegEx that matches exactly 8 digits
(3 answers)
Closed 5 years ago.
Basically I'm looking for a regex that matches some simple phone numbers.
I want to match numbers in a longer string of text like 123 4567, 891-0111, or 21314151, something that is (hopefully) identified by (\d{3,4}[- ]\d{3,4}|\d{4,8}), but I don't want to match them if they're part of a longer number like 3919503570275.
If I require the next character to be a non-digit or the end of a line, then that next character is also included in the match, which I don't want.
Surround your regex with a lookahead and a lookbehind to reject \d on both sides:
(?<!\d)(\d{3,4}[- ]\d{3,4}|\d{4,8})(?!\d)
Demo.
Note that this would accept a string that looks like a phone number preceded or followed by letters.
Depending on what programming language you use, I suggest to either use negative look-ahead or to use groups to extract the number.
See https://www.regular-expressions.info/lookaround.html for information about lookaround pattern.

I could not seem to understand (?=.*?[A-Z]) this expression [duplicate]

This question already has answers here:
Regex lookahead, lookbehind and atomic groups
(5 answers)
Closed 5 years ago.
I'm trying to learn a more advanced regular expressions for a password validator I'm working on because I think using regular expressions would be the best way out. I am using Java as my programming language
So for my pattern people suggested this (?=.*?[A-Z]) as to say "at least one upper case in the string". I have tried searching it at least but nothing seems to make it clear ?=.*? how this part makes sure it at least there.
here is the whole pattern ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!#$%^&*-]).{8,}$
from what i understand
? means optional and occurs once
= means well i don't know yet
. is a wildcard
[A-Z] is the range of uppercase letters from A-Z
TLDR: So my question is how does this (?=.*?[A-Z]) make it sure atleast one uppercase letter is included? Any in-depth explanation?
(?= is the start of a look-ahead group — the question mark does not mean the same as a ? elsewhere
.*? is a non-greedy match against anything or nothing. The question-mark here also does not mean 'optional'.
[A-Z] is a character set containing the upper case ASCII letters A through to Z.
) is the end of the look-ahead group
So the net result is:
"Look ahead and see if, after maybe some characters, there is an upper case letter."
Your full expression, ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!#$%^&*-]).{8,}$, can be read as:
"Match if the string contains an upper case letter, and a lower case letter, and a digit, and a non-alphanumeric, and there are at least 8 characters in total."
The regex is using a feature named positive lookahead, this is part of the regex lookarounds:
Positive lookahead: (?=...). Ex: a(?=b) matches a if followed by b
Negative lookahead: (?!...). Ex: a(?!b) matches a if not followed by b
Positive lookbehind: (?<=...). Ex: (?<=a)b matches b if preceded by a
Negative lookbehind: (?<!...). Ex: (?<=a)b matches b if not preceded by a
For your whole regex, you can see easily your pattern with this diagram:
Diagram link
Related to (?=.*?[A-Z]), it is being used after the ^. So, ^(?=.*?[A-Z])$ means match a line that start and end with whatever thing but having a uppercase character at the end

Regex, two uppercase characters in a string [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I'd like to use Regex to ensure that there are at least two upper case characters in a string, in any position, together or not.
The following gives me two together:
([A-Z]){2}
Environment - Classic ASP VB.
You can use the simple regex
[A-Z].*[A-Z]
which matches an upper case letter, followed by any number of anything (except linefeed), and another uppercase letter.
If you need it to allow linefeeds between the letters, you'll have to set the single line flag. If you're using JavaScript (U should always include flavor/language-tag when asking regex-related questions), it doesn't have that possibility. Then the solution suggested by Wiktor S in a comment to another answer should work.
[A-Z].*[A-Z]
A to Z , any symbols between, again A to Z
update
As Wiktor mentioned in comments:
This regex will check for 2 letters on a line (in most regex flavors), not in a string.
So
[A-Z][^A-Z]*[A-Z]
Should do the thing (In most regex flavors/tools)
I believe what you're looking for is something like this:
.*([A-Z]).*([A-Z]).*
Broken out into segments thats:
.* //Any number of characters (including zero)
([A-Z]) //A capital letter
.* //Any number of characters (including zero)
([A-Z]) //A second capital letter
.* //Any number of characters (including zero)

Regex IP wildcard Conversion [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I need to convert the following to Regex. I need the second, and third octets to be a wildcard. I am awful with Regex as I rarely ever need to use it. I have searched the web for a solution. Can someone help with this?
10.(Wildcard).(Wildcard).248
Thanks!
Short and sweet:
10\.\d{1,3}\.\d{1,3}\.248 will get you pretty close, and is relatively simple.
Escape the dot with \. to prevent it from matching any character
Use \d to match any digit character
Use {1,3} to limit the number of consecutive digits to 1, 2, or 3
More complicated, but more exact:
To only match numbers between 0 and 255, you could replace \d{1,3} with ([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]):
10\.([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])\.([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])\.248
- or -
10(\.([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])){2}\.248
Testing/developing regex patterns in the future
There are a lot of regex tester websites out there. I personally use RexexHero.net since I develop .Net applications, but there are other more generic options too such as regexpal.com.
10\.\d+\.\d+\.248
'\.' matches with '.', '\d' matches with any digit. '\d+' matches with 1 or more digits.
Other characters matches with themselves.
You'll need to escape the dot metacharacter with a backslash, but something like this will work:
10\.[0-9]{1,3}\.[0-9]{1,3}\.248
Note: this pattern isn't bulletproof. It doesn't check if the IP address is valid, it only checks if it matches an IP-like pattern. For example, it will match 10.999.999.248.