Cleaning up a regular expression which has lots of repetition - regex

I am looking to clean up a regular expression which matches 2 or more characters at a time in a sequence. I have made one which works, but I was looking for something shorter, if possible.
Currently, it looks like this for every character that I want to search for:
([A]{2,}|[B]{2,}|[C]{2,}|[D]{2,}|[E]{2,}|...)*
Example input:
AABBBBBBCCCCAAAAAADD

See this question, which I think was asking the same thing you are asking. You want to write a regex that will match 2 or more of the same character. Let's say the characters you are looking for are just capital letters, [A-Z]. You can do this by matching one character in that set and grouping it by putting it in parentheses, then matching that group using the reference \1 and saying you want two or more of that "group" (which is really just the one character that it matched).
([A-Z])\1{1,}
The reason it's {1,} and not {2,} is that the first character was already matched by the set [A-Z].

Not sure I understand your needs but, how about:
[A-E]{2,}
This is the same as yours but shorter.
But if you want multiple occurrences of each letter:
(?:([A-Z])\1+)+
where ([A-Z]) matches one capital letter and store it in group 1
\1 is a backreference that repeats group 1
+ assume that are one or more repetition
Finally it matches strings like the one you've given: AABBBBBBCCCCAAAAAADD
To be sure there're no other characters in the string, you have to anchor the regex:
^(?:([A-Z])\1+)+$
And, if you wnat to match case insensitive:
^(?i)(?:([A-Z])\1+)+$

Related

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

Regex end of match with multiple OR conditions

I want to seach all matches for a product reference with RegEx. The product references allways stars with "XXV-" followed by a number and in some cases with one or two chars. Some examples: "XXV-1234-AB", "XXV-1232", "XXV-12-X". The only anchor for the match is the start.
In this sample:
<div>Reference to search is XXV-1234-BH</div>
<div>Other reference to search is XXV-1235-VC
</div>
<div>And XXV-1236-HG
also</div>
I need to math all tree codes, better with a single regular expression.
I try with:
(?=XXV-).*?(?=<) that only matches the first.
(?=XXV-).*?(?=\n) matches all, but with < /div> of the first
It's posible to match "<" or "\n"?
Thanks for your time!
Why is your question about 'matching "<" or "\n"'? You know exactly what the format should be, so just build a regex for it.
The product references always stars with "XXV-" followed by a number and in some cases with one or two chars.
This can be matched by:
XXV-\d+(-[A-Z]{1,2})?
[A-Z] means "any character between A and Z". If you want to include other characters, e.g. lower-case, you could use: [a-zA-Z].
{1,2} means 1-2 of the previous pattern (in this case, [A-Z]).
? makes the capture group optional.
\d means any digit, i.e. it is equivalent to [0-9].
Here is a demo, with some test cases

Capturing uppercase words in text with regex

I'm trying to find words that are in uppercase in a given piece of text. The words must be one after the other to be considered and they must be at least 4 of them.
I have a "almost" working code but it captures much more: [A-Z]*(?: +[A-Z]*){4,}. The capture group also includes spaces at the start or the end of those words (like a boundary).
I have a playground if you want to test it out: https://regex101.com/r/BmXHFP/2
Is there a way to make the regex in example capture only the words in the first sentence? The language I'm using is Go and it has no look-behind/ahead.
In your regex, you just need to change the second * for a +:
[A-Z]*(?: +[A-Z]+){4,}
Explanation
While using (?: +[A-Z]*), you are matchin "a space followed by 0+ letters". So you are matching spaces. When replacing the * by a +, you matches spaces if there are uppercase after.
Demo on regex101
Replace the *s by +s, and your regex only matches the words in the first sentence.
.* also matches the empty string. Looking at you regex and ignoring both [A-Z]*, all that remains is a sequence of spaces. Using + makes sure that there is at least one uppercase char between every now and then.
You had to mark at least 1 upper case as [A-Z]*(?: +[A-Z]+){4,} see updated regex.
A better Regex will allow non spaces as [A-Z]*(?: *[A-Z]+){4,}.see better regex
* After will indicate to allow at least upper case even without spaces.

Regex to convert words in TitleCase

I use this regex to convert words in TitleCase and confirm each substitution:
:s/\%V\<\([A-Za-z0-9àäâæèéëêìòöôœùüûçÀÄÂÆßÈÉËÊÌÖÔŒÙÜÛ]\)\([A-Za-z0-9àäâæèéëêìòöôœùüûçÀÄÂÆßÈÉËÊÌÖÔŒÙÜÛ]*\)\>/\u\1\L\2/gc
However this matches also the words who are already in Titlecase.
Does anyone know how to change the above regex in order to jump over words who are already in TitleCase?
:s/\%V\<\([a-z0-9àäâæèéëêìòöôœùüûç]\)\([A-Za-z0-9àäâæèéëêìòöôœùüûçÀÄÂÆßÈÉËÊÌÖÔŒÙÜÛ]*\)\>/\u\1\L\2/gc
seems to do the trick, here.
Because you have explicitely included uppercase characters in the range you use in the first letter capture group, your pattern is going to match both foo and Foo. Removing the uppercase characters from that range seems to resolve your immediate problem.
To match only non-titlecase words, you want to match those that start either (a) with a lowercase letter or (b) with two uppercase letters. The following will do it (add accented letters and digits to taste):
\b([A-Z])([A-Z][A-Za-z]*)|\b([a-z])([a-zA-Z]+)
But some words match at groups \1 and \2, others at \3 and \4. I don't use vim so I can't say if it'll let you substitute with this kind of pattern. (E.g., \u\1\3\L\2\4; only two of the four will ever be non-empty)

match the same unknown character multiple times

I have a regex problem I can't seem to solve. I actually don't know if regex can do this, but I need to match a range of characters n times at the end of a pattern.
eg. blahblah[A-Z]{n}
The problem is whatever character matches the ending range need to be all the same.
For example, I want to match
blahblahAAAAA
blahblahEEEEE
blahblahQQQQQ
but not
blahblahADFES
blahblahZYYYY
Is there some regex pattern that can do this?
You can use this pattern: blahblah([A-Z])\1+
The \1 is a back-reference to the first capture group, in this case ([A-Z]). And the + will match that character one or more times. To limit it you can replace the + with a specific number of repetitions using {n}, such as \1{3} which will match it three times.
If you need the entire string to match then be sure to prefix with ^ and end with $, respectively, so that the pattern becomes ^blahblah([A-Z])\1+$
You can read more about back-references here.
In most regex implementations, you can accomplish this by referencing a capture group in your regex. For your example, you can use the following to match the same uppercase character five times:
blahblah([A-Z])\1{4}
Note that to match the regex n times, you need to use \1{n-1} since one match will come from the capture group.
blahblah(.)\1*\b should work in nearly all language flavors. (.) captures one of anything, then \1* matches that (the first match) any number of times.
blahblah([A-Z]|[a-z])\1+
This should help.