I'm new to RegEx and I'm looking for a way to match sentences where the first letter is capitalized and the rest is in lowercase.
I've tried a couple of things (IF statements included), but just can't seem to get it.
This is my last version:
(([A-Z])([a-z]+\s|[a-z]+))+
I thought it worked at first, but is now accepting capitalized letters in the middle of the word.
The Output Would Be Like This (Each Word Capitalized).
Thanks!!
The expression accepts capital letters in the middle of the world because now the spaces between words are optional, and words can run into each other.
You can take a more structured approach: a sentence must have at least one word. That's
[A-Z][a-z]*
After that initial word you can get any number of more words, each preceded by whitespace. So in total:
[A-Z][a-z]*(\s[A-Z][a-z]*)*
To match whole strings that start with an uppercase letter and then have no uppercase letters use
^[A-Z][^A-Z]*$
See the regex demo. ^ matches the start of string, [A-Z] matches the uppercase letters, [^A-Z]* matches 0 or more chars other than uppercase letters and $ matches the end of string.
To match capitalized words, you may use
\b[A-Z][a-zA-Z]*\b
where \b stands for word boundaries. See the regex demo.
In various regex flavors, there are other ways to match word boundaries:
bash,r (TRE, base R): \<[A-Z][a-zA-Z]*\>
postgresql, tcl: \m[A-Z][a-zA-Z]*\M or \y[A-Z][a-zA-Z]*\y
bash, mysql (MySQL versions before 8): [[:<:]][A-Z][a-zA-Z]*[[:>:]]
Also, you may consider using [[:upper:]] or \p{Lu} instead of [A-Z] and [[:alpha:]] or \p{L} instead of [a-zA-Z] to match any Unicode uppercase letters or any letters correspondingly.
See this demo and this demo, too.
Related
I am trying to create a unicode regex that matches every character except for a letter (of any language) and the punctuation signs .;:?!.
So for example the string
abcd 123 kjd ¤%/(" .?:!
should only match the bold parts below
abcd 123 kjd ¤%/(" .?:!
I know that \P{L}+ matches everything except a letter and \P{P}+ matches everything except a punctuation sign. How do I combine this two regex string to one? I have tried simply putting the together \P{L}+\P{P}+ but this does not give the required match. I have also tried writing [^.;:?!]\P{L}+ but this does not work either.
How do I combine one or more unicode regex or is there a better regex that achieves my requirement?
Using \P{L}+\P{P}+ will match 1+ times the opposite of any letter followed by 1+ times the opposite of any punctuation mark.
The pattern [^.;:?!]\P{L}+ matches 1 time any character other than the listed followed by 1+ times the opposite of any letter.
What you could do is add \p{L} (which will match any kind of letter) to the negated character class. As advised by Wiktor Stribiżew, you can add \p{Z} to match any kind of whitespace.
[^\p{Z}\p{L}.;:?!]
Regex demo
I am trying to come up with a regex that will allow small letters alongside with other characters but not if there are only small letters.
e.g.
Example # would match
example # would not match
So a simple ^[A-Za-z0-9 ]+$ will not do the trick.
Here is an example of what I want to achieve, the last folder contains a city which is always in small letters, therefore a pattern I want to exclude:
https://regex101.com/r/gP1evZ/2
How can that be achieved in regex for python?
You could use an alternation here:
^(?:[^a-z]+|(?=[^a-z]).+)$
Demo
This regex says to match:
^(?: from the start of the string
[^a-z]+ all non lowercase letters
| OR
(?=[^a-z]) assert that at least one non lowercase letter character appears
.+ then match one or more of any type of character
)$ end of the string
If you want to allow matching spaces, and the string should not contain only lower case chars or allow an empty string:
^(?![a-z ]+$)[A-Za-z0-9 ]*[A-Za-z0-9][A-Za-z0-9 ]*$
Regex demo
Or without the lookahead, match at least an uppercase char or digit
^[A-Za-z0-9 ]*[A-Z0-9][A-Za-z0-9 ]*$
Regex demo
Edit
For the updated data, you could use a negative lookahead (?!.*/[a-z]+/) to assert what is on the right is not only lowercase chars between forward slashes.
^/(hunde|kleinanzeigen)/(?!.*/[a-z]+/).*(prp_[a-z0-9_]+_\d+|cat_48_5030.*)\.html$
Regex demo
Or a bit broader match:
^/(hunde|kleinanzeigen)/(?!.*/[a-z]+/)\S+\.html$
Try
^(?![a-z\s]*$)
this should match strings that do not contain only lowercase characters and whitespaces. Remove \s if necessary.
I need to extract from a text all the words which match these two requirements:
Contain at least one uppercase letter
Don't fully consist of uppercase characters.
So, Word and WorD are correct captures, but word and WORD aren't.
So, I can capture all the words using a \b([a-zA-Z]+)\b Regex, but I don't know how to add the uppercase letters condition here.
As about the requirement #1, I tried to use a positive lookahead here like this:
\b(?=.*[A-Z]+)([a-zA-Z]+)\b , but now it captures all the words from a line if this line has at least one uppercase letter.
Is it even possible to apply additional conditions to a capturing group?
I can process this in my application's code but I'd really prefer to fit all those requirements in a single Regex.
You may use
\b(?=[A-Z]*[a-z])(?=[a-z]*[A-Z])([a-zA-Z]+)\b
See the regex demo
Actually, you do not even need the capturing group, ([a-zA-Z]+) can be usually replaced with [a-zA-Z]+, but it depends where you are using the regex.
Details
\b - word boundary
(?=[A-Z]*[a-z]) - a positive lookahead that requires a lowercase letter after 0+ uppercase ones
(?=[a-z]*[A-Z]) - a positive lookahead that requires a uppercase letter after 0+ lowercase ones
([a-zA-Z]+) - Group 1: 1 or more letters
\b - a word boundary.
I can use \s?(\w+\s){0,2}\w*) for "up to three words" and \w{0,20} for "no more than twenty characters", but how can I combine these? Trying to merge the two via a lookahead as mentioned here seems to fail.
Some examples for clarification:
The early bird catches the worm.
should match any three words in sequence (including the worm*).
Here we have a supercalifragilisticexpialidocious sentence.
"a supercalifragilisticexpialidocious sentence" is too long a sequence and therefore should not match.
* In my actual use case I'm going for a paragraph's last three words, i.e. a (?:\r) would be at the end of the RegEx and the match "catches the worm.") Matches are then applied with a "no linebreaks" character style in Adobe InDesign in order to avoid orphans.
To match 3 words separated with whitespace(s) at the end of a line or string, you can use
\b(?!(?:\s*\w){21})\w+(?:\s+\w+){0,2}(?=$|[\r\n])
See the regex demo. Note that in the demo, I use [^\S\r\n] instead of the \s in the lookahead since the text contains newlines, use the same trick if you need that.
Regex explanation
\b - a word boundary
(?!(?:\s*\w){21}) - a lookahead check that fails the match if after the initial word boundary there are 21 word characters optionally preceded with any number of whitespace symbols
\w+ - 1 word (consisting of 1 or more word characters)
(?:\s+\w+){0,2} - zero, one or two sequences of 1+ whitespaces followed with 1+ word characters
(?=$|[\r\n]) - a positive lookahead that only allows a match to be returned if there is the end-of-string ($) or the end of a line ([\r\n]).
Now, if your words should only contain letters, use [a-zA-Z] or equivalent for your language. If the regex flavor allows, use \p{L} Unicode category/property class.
In vim regex syntax, I am trying to match with all words with starting uppercase, and not starting underscore
\\([A-Z][a-z_][A-Za-z_]\\+\\)
This is the what i have untill now.
I want something like this:
\\([A-Z^\_][a-z_][A-Za-z_]\\+\\)
Where [A-Z^\\_] denotes that it should match with all uppercase chars, but not underscore.
Any help would be greatly apreciated. Thanks in advance.
Edit: My question was woorded poorly. I want the first set to match with an uppercase char, which does not have an underscore in front of it. Sorry.
[A-Z] already does not include underscores; I guess you want to match whole words, so you don't want your regular expression to match inside a word. Vim has built-in \< and \> (like \b in other regular expression dialects, see #npinti's answer) for keyword boundaries; as lower/uppercase and underscore characters are usually keyword characters, wrapping your pattern with those should already be close enough:
\<\([A-Z][a-z_][A-Za-z_]\+\)\>
To strictly assert no underscore before your match (but allow any other keyword or non-keyword characters there), you'd need a negative lookbehind: \#<! means is not preceded by:
_\#<!\([A-Z][a-z_][A-Za-z_]\+\)
Where [A-Z^\_] denotes that it should match with all uppercase chars, but not underscore.
[A-Z] already matches with all uppercase chars excluding underscore. However in your first solution, you request that the second letter be lowercase or underscore ([a-z_]). If I stick to your definition:
all words with starting uppercase, and not starting underscore
Then [A-Z][A-Za-z_]+ should work.