I'm doing a file search, and I want to exclude files that contain min in them, but I don't want to match parts of words.
so I want to exclude:
a directory named min
also a file named jhkjahdf-min.txt
But I don't want to exclude:
a file named mint.txt
Thank you for your help in advanced. Explanations of the regular expression you give would be a lot of extra help.
I'm using PHP's preg_match() function
There are things called word boundaries (\b). They do not consume a character, but match if you go from a word character (letters, digits, underscores) to a non-word character of vice-versa. So this is a first good approximation:
`\bmin\b`
Now if you want to consider digits and underscores as word boundaries, too, it gets a bit more complicated. Then you need negative lookaheads and lookbehinds, which are not supported by all regex engines:
`(?<![a-zA-Z])min(?![a-zA-Z])`
These will also not be included in the match (if you care about it at all), but just assert that min is neither preceded nor followed by a letter.
You'll want to anchor your regex with \b which means "word boundary."
Related
I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.
As such, it should pick up:
ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.
I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.
However this does also pick up these wrong words:
A-bc, a-b-c
I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.
If a lookahead is supported and you don't want to match double -- you might use:
\b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
Explanation
\b A word boundary
(?= Positive lookahead, assert that from the current location to the right is
(?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
) Close the lookahead
[A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
(?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
\b A word boundary
See a regex101 demo.
To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.
\b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
See another regex demo.
I am trying to extract words that have at least one character from a special character set. It picks up some words and not others. Here is a link to regex101 to test it. This it the regex \b(\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*)\b, and this is the sample sentence I am using
His full name is Abu ʿĪsa Muḥammad ibn ʿĪsa ibn Sawrah ibn Mūsa ibn
Al-Daḥāk Al-Sulamī Al-Tirmidhī.
It should match the following words:
ʿĪsa Muḥammad ʿĪsa Mūsa Al-Daḥāk Al-Sulamī Al-Tirmidhī
I am not too experienced with regex, so I have no idea what I am doing wrong. If someone knows any tool to find out why a specific word doesn't match a regex pattern, please let me know as well.
You can use
[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ][\wāīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]*
After matching the one required special character, use another character set to match more occurrences of those characters or normal word characters.
https://regex101.com/r/ovJoLt/2
You can make this work by enabling the Unicode flag /u (so that the word boundary \b assertions support Unicode characters) and adding hyphens to the surrounding character groups:
/\b[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+[\w-]*\b/gu
Plus, you don't need the capturing group, since the only characters being matched form the desired output anyway (\b is a zero-width assertion).
Demo
You are not doing anything wrong except that to match unicode boundaries you have to enable u modifier or use (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*(?!\S)
If you want to match hyphen add it to your character class (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]+\w*(?!\S)
I have the following regex:
^(?!FIT|FAT|FUTURE)[F-I].*
[F-I] shows that any words starting with F to I should match apart from the given list of words that shouldn't match.
Currently, it doesn't match a word like FITTER, but I only want the regex to not match if it's a whole word such as FIT, FAT and FUTURE.
These are following scenarios I want it to work for:
Matches
FUTURE-YES
FITTER
GOOD
ITCHY
Non Matches
FIT
FAT
FUTURE
Brief
Usually, you would use word boundaries \b to ensure the edge of a word. In your case, however, you have some words that use hyphens -, thus, this is likely the solution you're looking for.
Code
See regex in use here
^(?!(?:FIT|FAT|FUTURE)(?![\w-]))[F-I][\w-]*$
Results
Input
FUTURE-YES
FITTER
GOOD
ITCHY
FIT
FAT
FUTURE
Output
Note: Shown below are matches
FUTURE-YES
FITTER
GOOD
ITCHY
Explanation
^ Assert position at the start of the line
(?!(?:FIT|FAT|FUTURE)(?![\w-])) Negative lookahead ensuring what follows does not match
(?:FIT|FAT|FUTURE) Match either FIT, FAT or FUTURE literally
(?![\w-]) Negative lookahead ensuring what follows does not match a word character or hyphen -
[F-I] Match a character between F and I (FGHI)
[\w-]* Match any word character or hyphen - character any number of times
$ Assert position at the end of the line
I'm trying to use a Regex pattern (in Java) to find a sequence of 3 digits and only 3 digits in a row. 4 digits doesn't match, 2 digits doesn't match.
The obvious pattern to me was:
"\b(\d{3})\b"
That matches against many source string cases, such as:
">123<"
" 123-"
"123"
But it won't match against a source string of "abc123def" because the c/1 boundary and the 3/d boundary don't count as a "word boundary" match that the \b class is expecting.
I would have expected the solution to be adding a character class that includes both non-Digit (\D) and the word boundary (\b). But that appears to be illegal syntax.
"[\b\D](\d{3})[\b\D]"
Does anybody know what I could use as an expression that would extract "123" for a source string situation like:
"abc123def"
I'd appreciate any help. And yes, I realize that in Java one must double-escape the codes like \b to \b, but that's not my issue and I didn't want to limit this to Java folks.
You should use lookarounds for those cases:
(?<!\d)(\d{3})(?!\d)
This means match 3 digits that are NOT followed and preceded by a digit.
Working Demo
Lookarounds can solve this problem, but I personally try to avoid them because not all regex engines fully support them. Additionally, I wouldn't say this issue is complicated enough to merit the use of lookarounds in the first place.
You could match this: (?:\b|\D)(\d{3})(?:\b|\D)
Then return: \1
Or if you're performing a replacement and need to match the entire string: (?:\b|\D)+(\d{3})(?:\b|\D)+
Then replace with: \1
As a side note, the reason \b wasn't working as part of a character class was because within brackets, [\b] actually has a completely different meaning--it refers to a backspace, not a word boundary.
Here's a Working Demo.
I am trying to parse a phrase and exclude common words.
For instance in the phrase "as the world turns", I want to exclude the common words "as" and "the" and return only "world" and "turns".
(\w+(?!the|as))
Doesn't work. Feedback appreciated.
The lookahead should come first:
(\b(?!(the|as)\b)\w+\b)
I have also added word boundaries to ensure that it only matches whole words otherwise it would fail to match the complete word "as" but it would successfully match the letter "s" of that word.
You might also want to consider what \w matches and if that meets your needs. If you are looking for words in English you probably are interested in letters but not digits and you may wish to include some punctuation characters that are excluded by \w, such as apostrophes. You could try something like this instead (Rubular):
/(\b(?!(?:the|as)\b)[a-z'-]+\b)/i
To match words more accurately in a human language you could consider using a natural language parsing library instead of regular expressions.
You should use word boundaries to only match whole words. Either with a look-ahead assertion:
(\b(?!(?:the|as)\b)\w+\b)
Or with a look-behind assertion:
(\b\w+\b(?<!\b(?:the|as)))