Remove non-alphabetic words from a sentence using regex - regex

Is it possible to remove words in a sentence that doesn't contain a-z letters? I've thought about negative look arounds but wasn't successful.
For example,
This is a 1-2-a3 sample 12 -- 7-8 sentence
becomes
This is a 1-2-a3 sample sentence
Assume all other punctuations were removed except dashes.
Thanks!

The below regex would match those words which won't contain an alphabet.
(?<!\S)[^a-zA-Z\s]+(?!\S)
DEMO
Just replace those matched words with an empty string to get your desired output. (?<!\S) negative lookbehind which asserts that the match won't be preceded by a non-space character. (?!\S) negative lookahead which asserts that the match won't be followed by a non-space character.

Related

Regex command to match combinations but not only uppercase letters

Is there a regex command to match all combinations of uppercase letters, lowercase, underscore, brackets, numbers, but not only Uppercase letter words or only numbers?
I thought i had it with this one:
(/\b(?![A-Z]+\b)(?![0-9]+\b)[a-zA-Z0-9_{}]+\b/)
That was until i encountered: ABC{hello}_HI_HelLo
This is not a match, and i would like my regex to match this string.
There seem to be something with the negative lookahead since it reads "ABC" and assumes it is a Uppercase letter word only so it does not match the string, only the part after the "{" is matched.
When you add an underscore after "ABC" you get a matching string: ABC_{hello}_HI_HelLo
There is a word boundary between _ and {
You can assert a whitespace boundary to the left (?<!\S) and the right (?!\S) instead.
The pattern matches:
(?<!\S) Assert a whitespace boundary to the left
(?![A-Z]+(?!\S)) Assert not only uppercase chars followed by a whitespace boundary at the right
(?![0-9]+(?!\S)) Assert not only digits followed by a whitespace boundary at the right
[a-zA-Z0-9_{}]+ Match 1 or more occurrences of any of the listed
Regex demo

How to match anything but exclusively small letters?

I am trying to come up with a regex that will allow small letters alongside with other characters but not if there are only small letters.
e.g.
Example # would match
example # would not match
So a simple ^[A-Za-z0-9 ]+$ will not do the trick.
Here is an example of what I want to achieve, the last folder contains a city which is always in small letters, therefore a pattern I want to exclude:
https://regex101.com/r/gP1evZ/2
How can that be achieved in regex for python?
You could use an alternation here:
^(?:[^a-z]+|(?=[^a-z]).+)$
Demo
This regex says to match:
^(?: from the start of the string
[^a-z]+ all non lowercase letters
| OR
(?=[^a-z]) assert that at least one non lowercase letter character appears
.+ then match one or more of any type of character
)$ end of the string
If you want to allow matching spaces, and the string should not contain only lower case chars or allow an empty string:
^(?![a-z ]+$)[A-Za-z0-9 ]*[A-Za-z0-9][A-Za-z0-9 ]*$
Regex demo
Or without the lookahead, match at least an uppercase char or digit
^[A-Za-z0-9 ]*[A-Z0-9][A-Za-z0-9 ]*$
Regex demo
Edit
For the updated data, you could use a negative lookahead (?!.*/[a-z]+/) to assert what is on the right is not only lowercase chars between forward slashes.
^/(hunde|kleinanzeigen)/(?!.*/[a-z]+/).*(prp_[a-z0-9_]+_\d+|cat_48_5030.*)\.html$
Regex demo
Or a bit broader match:
^/(hunde|kleinanzeigen)/(?!.*/[a-z]+/)\S+\.html$
Try
^(?![a-z\s]*$)
this should match strings that do not contain only lowercase characters and whitespaces. Remove \s if necessary.

Unmatch complete words if a negative lookahead is satisfied

I need to match only those words which doesn't have special characters like # and :.
For example:
git#github.com shouldn't match
list should return a valid match
show should also return a valid match
I tried it using a negative lookahead \w+(?![#:])
But it matches gi out of git#github.com but it shouldn't match that too.
You may add \w to the lookahead:
\w+(?![\w#:])
The equivalent is using a word boundary:
\w+\b(?![#:])
Besides, you may consider adding a left-hand boundary to avoid matching words inside non-word non-whitespace chunks of text:
^\w+(?![\w#:])
Or
(?<!\S)\w+(?![\w#:])
The ^ will match the word at the start of the string and (?<!S) will match only if the word is preceded with whitespace or start of string.
See the regex demo.
Why not (?<!\S)\w+(?!\S), the whitespace boundaries? Because since you are building a lexer, you most probably have to deal with natural language sentences where words are likely to be followed with punctuation, and the (?!\S) negative lookahead would make the \w+ match only when it is followed with whitespace or at the end of the string.
You can use negative lookbehind and negative lookahead patterns around a word pattern to make sure that the word is not preceded or followed by a non-space character, or in other words, to make sure that it is surrounded by either a space or a string boundary:
(?<!\S)\w+(?!\S)
Demo: https://regex101.com/r/cjhUUM/2

How to match a specific word without spaces and without an additional letter in the starting or ending?

Let's say I have word phone
It's possible matches in my case are as follows
phone (no space in the beginning and in the ending just phone)
"phone" (can have special characters at the end or in the beginning)
Cases to be Neglected [Here I'll mark the space with \s]
phone\s (any space in either in the beginning or in the end should not be matched)
phoneno (any alphabets or numbers appended with phone should not be matched)
I've tried the following regex [^\w\s]items[^\w\s] link
But It didn't match the case of phone with no space in the beginning and the end as it requires 1 letter other than space and alphabets in the beginning and the end
Kindly suggest any other solutions which satisfies above mentioned cases
You can find the regex here
You may use custom word boundaries, a combination of \b and (?<!\S) / (?!\S):
(?<![\w\s])phone(?![\w\s])
See the regex demo and the regex graph:
The (?<![\w\s]) negative lookbehind pattern matches a location in string that is NOT immediately preceded with a word or whitespace char.
The (?![\w\s]) negative lookahead pattern matches a location in string that is NOT immediately preceded with a word or whitespace char.

How do I write a regex for words having alphanumeric charcters but not made of only numbers?

For a line input "Abcd abcd1a 5ever qw3-fne superb5 1234 0"
I am trying to match words having letters and numbers, like "Abcd","abcd1a","5ever", "superb5","qw3","fne". But it should not match words having only numbers, like "1234", "0".
Words are separated by all the characters other than above alphanumerics.
I tried this regex (?![0-9])([A-Za-z0-9]+) which fails to match the word "5ever" but works properly for everything else.
How do I write this regex so that it also matches the word "5ever" in full?
Option 1 - Negative lookahead
See regex in use here
\b(?!\d+\b)[^\W_]+
\b(?!\d+\b)[A-Za-z\d]+
\b(?!\d+\b)[a-z\d]+ # With case-insensitive flag enabled
\b Assert position as a word boundary
(?!\d+\b) Negative lookahead ensuring the whole word isn't made up of only digits
[^\W_]+ or [A-Za-z\d]+ Matches only letters or digits one or more times
Option 2 - Without lookahead
Another alternative as seen in use here (case-insensitive i flag enabled):
\b\d*[a-z][a-z\d]* # With case-insensitive flag enabled
\b\d*[A-Za-z][A-Za-z\d]*
\b Assert position as a word boundary
\d* Match any digit any number of times
[a-z] Match any letter (with i flag enabled this also matches A-Z)
[a-z\d]* Match any letter or digit any number of times
Matches the following from the string Abcd abcd1a 5ever qw3-fne superb5 1234 0:
Abcd
abcd1a
5ever
qw3
fne
superb5
I came up with the following regex:
/\d*[a-z_]+\w*/ig
\d* starts with possible digit(s)
[a-z_]+ contains letter or underscore in qty one and more
\w* possibly followed by any characters after that letter
ig case insensitive and global flags
DEMO with detailed explanation