Regex command to match combinations but not only uppercase letters - regex

Is there a regex command to match all combinations of uppercase letters, lowercase, underscore, brackets, numbers, but not only Uppercase letter words or only numbers?
I thought i had it with this one:
(/\b(?![A-Z]+\b)(?![0-9]+\b)[a-zA-Z0-9_{}]+\b/)
That was until i encountered: ABC{hello}_HI_HelLo
This is not a match, and i would like my regex to match this string.
There seem to be something with the negative lookahead since it reads "ABC" and assumes it is a Uppercase letter word only so it does not match the string, only the part after the "{" is matched.
When you add an underscore after "ABC" you get a matching string: ABC_{hello}_HI_HelLo

There is a word boundary between _ and {
You can assert a whitespace boundary to the left (?<!\S) and the right (?!\S) instead.
The pattern matches:
(?<!\S) Assert a whitespace boundary to the left
(?![A-Z]+(?!\S)) Assert not only uppercase chars followed by a whitespace boundary at the right
(?![0-9]+(?!\S)) Assert not only digits followed by a whitespace boundary at the right
[a-zA-Z0-9_{}]+ Match 1 or more occurrences of any of the listed
Regex demo

Related

Regex: uppercase words that donĀ“t start with a hyphen

I need to match all uppercase words that don't start with a hyphen.
There are multiple uppercase words in each line.
examples:
,BOAT -> match
BANANA, -> match
WATER -> match
-ER -> no match because of hyphen
Thanks in advance :)
I need to match all uppercase words that don't start with a hyphen.
You may use this regex:
(?<!\S)[^-A-Z\s]*[A-Z]+
RegEx Demo
RegEx Explained:
(?<!\S): Make sure we don't have a non-space before current position
[^-A-Z\s]*: Match 0 or more of any characters that are not hyphen and not uppercase letters and not whitespaces
[A-Z]+: Match 1+ uppercase letters
You can use
\b(?<!-)[A-Z]+\b
\b(?<!-)\p{Lu}+\b
See the regex demo
Details:
\b - word boundary
(?<!-) - a negative lookbehind that fails the match if there is a - immediately to the left of the current position
[A-Z]+ / \p{Lu}+ - one or more uppercase letters (\p{Lu} matches any uppercase Unicode letters)
\b - word boundary.

Match everything until upcase word

I want to capture a word placed before another one which is full capitalized
Mister Foo BAR is here # => "Foo"
Miss Bar-Barz FOO loves cats # => "Bar-Barz"
I've been trying the following regex: (Mister|Miss)\s([[:alpha:]\s\-]+)(?=\s[A-Z]+), but sometimes it includes the rest of the sentence. For example, it'll return Bar-Barz FOO loves cats instead of Bar-Barz).
How can I say, using RegExp, "match every words until the upcase word" ?
To clarify the usage of negative lookahead, can we say it "captures until the specified sub-pattern matches, but does not include it to the match data" ?
As a non-native English speaker, apologies if my answer isn't perfectly formulated. Thanks by advance
Match 1+ word chars optionally repeated by a - and 1+ word chars to not match only hyphens or a hyphen at the end.
Assert a space followed by 1+ uppercase chars and a word boundary at the right.
\w+(?:-\w+)*(?=\s[A-Z]+\b)
Explanation
\w+ Match 1+ word char
(?:-\w+)* Optionally repeat matching - and 1+ word chars
(?=\s[A-Z]+\b) Positive lookahead, assert what is directly at the right is 1+ uppercase chars A-Z followed by a word boundary
Regex demo
If there can not be any newlines between the words, you can use [^\S\r\n] instead of \s
\w+(?:-\w+)*(?=[^\S\r\n]+[A-Z]+\b)
Regex demo
I want to capture a word placed before another one which is full capitalized
You may use this regex with a lookahead:
\b\S+(?=[ \t]+[A-Z]+\b)
RegEx Demo
RegEx Description:
\b: Word boundadry
\S+: Match 1+ non-whitespace characters
(?=[ \t]+[A-Z]+\b): Positive lookahead that asserts we have 1+ space and then a word containing only capital letters
You don't say what language you're working in, but the following works for me. The idea is to stop when the parser hits a sequence of uppercase letters/hyphens.
JS example:
let ptn = /(Mister|Miss)\s[\w\-]+(?=\s[A-Z\-]+)/;
"Mister Foo BAR is here".match(ptn); //["Mister Foo", "Mister"]
"Miss Bar-Barz FOO loves cats".match(ptn); //["Miss Bar-Barz", "Miss"]

Unmatch complete words if a negative lookahead is satisfied

I need to match only those words which doesn't have special characters like # and :.
For example:
git#github.com shouldn't match
list should return a valid match
show should also return a valid match
I tried it using a negative lookahead \w+(?![#:])
But it matches gi out of git#github.com but it shouldn't match that too.
You may add \w to the lookahead:
\w+(?![\w#:])
The equivalent is using a word boundary:
\w+\b(?![#:])
Besides, you may consider adding a left-hand boundary to avoid matching words inside non-word non-whitespace chunks of text:
^\w+(?![\w#:])
Or
(?<!\S)\w+(?![\w#:])
The ^ will match the word at the start of the string and (?<!S) will match only if the word is preceded with whitespace or start of string.
See the regex demo.
Why not (?<!\S)\w+(?!\S), the whitespace boundaries? Because since you are building a lexer, you most probably have to deal with natural language sentences where words are likely to be followed with punctuation, and the (?!\S) negative lookahead would make the \w+ match only when it is followed with whitespace or at the end of the string.
You can use negative lookbehind and negative lookahead patterns around a word pattern to make sure that the word is not preceded or followed by a non-space character, or in other words, to make sure that it is surrounded by either a space or a string boundary:
(?<!\S)\w+(?!\S)
Demo: https://regex101.com/r/cjhUUM/2

Remove non-alphabetic words from a sentence using regex

Is it possible to remove words in a sentence that doesn't contain a-z letters? I've thought about negative look arounds but wasn't successful.
For example,
This is a 1-2-a3 sample 12 -- 7-8 sentence
becomes
This is a 1-2-a3 sample sentence
Assume all other punctuations were removed except dashes.
Thanks!
The below regex would match those words which won't contain an alphabet.
(?<!\S)[^a-zA-Z\s]+(?!\S)
DEMO
Just replace those matched words with an empty string to get your desired output. (?<!\S) negative lookbehind which asserts that the match won't be preceded by a non-space character. (?!\S) negative lookahead which asserts that the match won't be followed by a non-space character.

complicate regexp matching words (is it possible?)

I want a regex to match all Greek (utf-8) words that do NOT:
end with .
end with -
end with '
end with numbers (1-9)
start with .
start with ,
start with -
the first letter is capital
all letters are capital
Is this possible? To match Greek words I use \p{Greek}{3,} which matches Greek UTF-8 words that have at least 3 characters.
I write programs in ruby, but if it can be done in perl or any other cli tool/language I'll write a script to dump the output in a text file.
(?<!\S)(?=\S*\p{Greek})(?![-,.\p{Lu}])(?![\p{Lu}\P{L}]+\b)\S+(?<![-.'1-9])(?!\S)
Let's break this beasty down:
The core of the regex is the \S+ in the middle which is surrounded by a bunch of positive and negative assertions.
(?<!\S) - The word must not be preceded by a non-whitespace character. This makes sure we don't start our match in the middle of a word.
(?=\S*\p{Greek}) - There must be at least one Greek letter in there somewhere.
(?![-,.\p{Lu}]) - The word must not start with a dash, comma, dot, or uppercase letter \p{Lu}.
(?![\p{Lu}\P{L}]+\b) - The word must not be all uppercase letters and symbols.
(?<![-.'1-9]) - The word must not end with a dash, dot, apostrophe, or digit 1 through 9.
(?!\S) - The word must not be followed by a non-whitespace character. This makes sure we don't end our match in the middle of a word.