Regex to convert words in TitleCase - regex

I use this regex to convert words in TitleCase and confirm each substitution:
:s/\%V\<\([A-Za-z0-9àäâæèéëêìòöôœùüûçÀÄÂÆßÈÉËÊÌÖÔŒÙÜÛ]\)\([A-Za-z0-9àäâæèéëêìòöôœùüûçÀÄÂÆßÈÉËÊÌÖÔŒÙÜÛ]*\)\>/\u\1\L\2/gc
However this matches also the words who are already in Titlecase.
Does anyone know how to change the above regex in order to jump over words who are already in TitleCase?

:s/\%V\<\([a-z0-9àäâæèéëêìòöôœùüûç]\)\([A-Za-z0-9àäâæèéëêìòöôœùüûçÀÄÂÆßÈÉËÊÌÖÔŒÙÜÛ]*\)\>/\u\1\L\2/gc
seems to do the trick, here.
Because you have explicitely included uppercase characters in the range you use in the first letter capture group, your pattern is going to match both foo and Foo. Removing the uppercase characters from that range seems to resolve your immediate problem.

To match only non-titlecase words, you want to match those that start either (a) with a lowercase letter or (b) with two uppercase letters. The following will do it (add accented letters and digits to taste):
\b([A-Z])([A-Z][A-Za-z]*)|\b([a-z])([a-zA-Z]+)
But some words match at groups \1 and \2, others at \3 and \4. I don't use vim so I can't say if it'll let you substitute with this kind of pattern. (E.g., \u\1\3\L\2\4; only two of the four will ever be non-empty)

Related

Capturing uppercase words in text with regex

I'm trying to find words that are in uppercase in a given piece of text. The words must be one after the other to be considered and they must be at least 4 of them.
I have a "almost" working code but it captures much more: [A-Z]*(?: +[A-Z]*){4,}. The capture group also includes spaces at the start or the end of those words (like a boundary).
I have a playground if you want to test it out: https://regex101.com/r/BmXHFP/2
Is there a way to make the regex in example capture only the words in the first sentence? The language I'm using is Go and it has no look-behind/ahead.
In your regex, you just need to change the second * for a +:
[A-Z]*(?: +[A-Z]+){4,}
Explanation
While using (?: +[A-Z]*), you are matchin "a space followed by 0+ letters". So you are matching spaces. When replacing the * by a +, you matches spaces if there are uppercase after.
Demo on regex101
Replace the *s by +s, and your regex only matches the words in the first sentence.
.* also matches the empty string. Looking at you regex and ignoring both [A-Z]*, all that remains is a sequence of spaces. Using + makes sure that there is at least one uppercase char between every now and then.
You had to mark at least 1 upper case as [A-Z]*(?: +[A-Z]+){4,} see updated regex.
A better Regex will allow non spaces as [A-Z]*(?: *[A-Z]+){4,}.see better regex
* After will indicate to allow at least upper case even without spaces.

Find and Replace with Space Inserted

I've got strings like this in a text file:
10.Divide using the divider at 12C. and pressure at 3.0.
11.Form into cylinders and put on boards, don't handle too much.
This Regex (\d+\.)[A-Z] correctly finds a numeric value, followed by a period, followed by a capital letter.
I want to insert a space between the period and the capital letter. How do I do this?
Actually your regex is wrong:
(\d+.)[A-Z] matches 1-or-more occurennce of digits, follow by ANY CHARACTER. . in regex means any character. The more correct one should be \d+\.[A-Z] (Omitted the group too as it is not required for matching. Note that the . is escaped).
In order to insert space, apart from the solution provided by another answer by using 2 groups: i.e. Find (\d+\.)([A-Z]) (note the dot fixed) and replace with \1 \2, you may also consider using lookaround feature:
Find (?<=\d\.)(?=[A-Z]) and Replace with (a single space). This regex find a spot that is preceded with a digit and then a dot, and is followed by a capital letter. Then we are replacing that spot with a space. (Note that lookahead and lookbehind group is not included in the "matched" result)
You're mostly there. When you wrap a regex subexpression in parentheses, you can refer to it in the "Replace" field of a find and replace operation. So...
Find what: (\d+.)([A-Z])
Replace with: \1 \2
(See similar questions like this one.)

Cleaning up a regular expression which has lots of repetition

I am looking to clean up a regular expression which matches 2 or more characters at a time in a sequence. I have made one which works, but I was looking for something shorter, if possible.
Currently, it looks like this for every character that I want to search for:
([A]{2,}|[B]{2,}|[C]{2,}|[D]{2,}|[E]{2,}|...)*
Example input:
AABBBBBBCCCCAAAAAADD
See this question, which I think was asking the same thing you are asking. You want to write a regex that will match 2 or more of the same character. Let's say the characters you are looking for are just capital letters, [A-Z]. You can do this by matching one character in that set and grouping it by putting it in parentheses, then matching that group using the reference \1 and saying you want two or more of that "group" (which is really just the one character that it matched).
([A-Z])\1{1,}
The reason it's {1,} and not {2,} is that the first character was already matched by the set [A-Z].
Not sure I understand your needs but, how about:
[A-E]{2,}
This is the same as yours but shorter.
But if you want multiple occurrences of each letter:
(?:([A-Z])\1+)+
where ([A-Z]) matches one capital letter and store it in group 1
\1 is a backreference that repeats group 1
+ assume that are one or more repetition
Finally it matches strings like the one you've given: AABBBBBBCCCCAAAAAADD
To be sure there're no other characters in the string, you have to anchor the regex:
^(?:([A-Z])\1+)+$
And, if you wnat to match case insensitive:
^(?i)(?:([A-Z])\1+)+$

Regex for this particular pattern

I have three different things
xxx
xxx>xxx
xxx>xxx>xxx
Where xxx can be any combination of letters and number
I need a regex that can match the first two but NOT the third.
To match ASCII letters and digits try the following:
^[a-zA-Z0-9]{3}(>[a-zA-Z0-9]{3})?$
If letters and digits outside of the ASCII character set are required then the following should suffice:
^[^\W_]{3}(>[^\W_]{3})?$
^\w+(?:>\w+)?$
matches an entire string.
\w+(?:>\w+)?\b(?!>)
matches strings like this in a larger substring.
If you want to exclude the underscore from matching, you can use [\p{L]\p{N}] instead (if your regex engine knows Unicode), or [^\W_] if it doesn't, as a substitute for \w.

Regex to validate number and letter sequence

I want a regex to validate inputs of the form AABBAAA, where A is a a letter (a-z, A-Z) and B is a digit (0-9). All the As must be the same, and so must the Bs.
If all the A's and B's are supposed to be the same, I think the only way to do it would be:
([a-zA-Z])\1([0-9])\2\1\1\1
Where \1 and \2 refer to the first and second parenthetical groupings. However, I don't think all regex engines support this.
It's really not as hard as you think; you've got most of the syntax already.
[a-zA-Z]{2}[0-9]{2}[a-zA-Z]{3}
The numbers in braces ({}) tell how many times to match the previous character or set of characters, so that matches [a-zA-Z] twice, [0-9] twice, and [a-zA-Z] three times.
Edit: If you want to make sure the matched string is not part of a longer string, you can use word boundaries; just add \b to each end of the regex:
\b[a-zA-Z]{2}[0-9]{2}[a-zA-Z]{3}\b
Now "Ab12Cde" will match but "YZAb12Cdefg" will not.
Edit 2: Now that the question has changed, backreferences are the only way to do it. edsmilde's answer should work; however, you may need to add the word boundaries to get your final solution.
\b([a-zA-Z])\1([0-9])\2\1\1\1\b
[a-zA-Z]{2}\d{2}[a-zA-Z]{3}