REGEX - nesting quantifiers in combined statements - regex

A last name in Hebrew can be either in an English format, which is just a regular combination of letters, like "Smith", "Camp", "Jack" etc, or a combination of two words with a space in the middle, like "Ben David", "Bar Yohay", "Yom Tov". i tried to create a regexp that allows either the first format - a last name that is at least two letters long, or the second one - a last name that is composed of two words, each two or more letters long, with a space in the middle. here is what i came up with:
(^[a-z]{2,}$)|((^[a-z]{2,}$)(^[ ]$)(^[a-z]{2,}$))
(I know it does not allow capital letters)
For some reason it does allow names of the first format like Smith and Jerry, but does not allow names of the second one. is there a problem with the formatting of the space in the middle? This should be an easy one for regexp professionals. thanks in advance :)

You can simplify your regex to
^[a-z]{2,}(?: [a-z]{2,})?$

You are misusing anchors (^ and $). These match the beginning and ending of the string, respectively. What you actually want is:
(^[a-z]{2,}$)|(^([a-z]{2,})([ ])([a-z]{2,})$)
Further, you can simplify your expression to:
^[a-z]{2,}$|^[a-z]{2,} [a-z]{2,}$
unless you specifically need to capture groups.
Or (so you only need one pair of anchors):
^(?:[a-z]{2,}|[a-z]{2,} [a-z]{2,})$
(?:...) is a non-capturing group, necessary to restrict the scope of the alternation.

Related

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

What would be the Regex expression to get the first letter after a group of character and some integers?

I have a string that the following structure:
ABCD123456EFGHIJ78 but sometimes it's missing a number or a character like:
ABC123456EFGHIJ78 or
ABCD123456E or
ABCD12345EFGHIJ78
etc.
That's why I need regular expressions.
What I want to extract is the first letter of the third group, in this case 'E'.
I have the following regex:
(\D+)+(\d+)+(\D{1})\3
but I don't get the letter E.
This seems to work for the example cases you provided.
^(?:[A-Za-z]+)(?:\d+)(.)
It assumes that the first group is only letters and that the second group is only digits.
There's already a nice answer.
But for the records, your initial proposal was very close to work. You just needed to say that the character matching the 3rd group can repeat several times by adding a star:
^(\D+)(\d+)(\D{1})\3*
The main weakness is that \D matches any char except digits, so also spaces. Making it more robust leads us to explicit the range of chars accepted:
^([A-Za-z]+)(\d+)([A-Za-z]{1})\3*
It's much better, but my favourite uses \w to match at the end of the pattern any non white character:
([A-Za-z]+)(\d+)([A-Za-z]{1})\w*

Cleaning up a regular expression which has lots of repetition

I am looking to clean up a regular expression which matches 2 or more characters at a time in a sequence. I have made one which works, but I was looking for something shorter, if possible.
Currently, it looks like this for every character that I want to search for:
([A]{2,}|[B]{2,}|[C]{2,}|[D]{2,}|[E]{2,}|...)*
Example input:
AABBBBBBCCCCAAAAAADD
See this question, which I think was asking the same thing you are asking. You want to write a regex that will match 2 or more of the same character. Let's say the characters you are looking for are just capital letters, [A-Z]. You can do this by matching one character in that set and grouping it by putting it in parentheses, then matching that group using the reference \1 and saying you want two or more of that "group" (which is really just the one character that it matched).
([A-Z])\1{1,}
The reason it's {1,} and not {2,} is that the first character was already matched by the set [A-Z].
Not sure I understand your needs but, how about:
[A-E]{2,}
This is the same as yours but shorter.
But if you want multiple occurrences of each letter:
(?:([A-Z])\1+)+
where ([A-Z]) matches one capital letter and store it in group 1
\1 is a backreference that repeats group 1
+ assume that are one or more repetition
Finally it matches strings like the one you've given: AABBBBBBCCCCAAAAAADD
To be sure there're no other characters in the string, you have to anchor the regex:
^(?:([A-Z])\1+)+$
And, if you wnat to match case insensitive:
^(?i)(?:([A-Z])\1+)+$

Regex - Not a match if there is a dash

I have two kinds of strings.
index_12323 (just numbers, no dash).
index_12a-dcd-edff (basically an uuid).
What's a regex that only matches the first one, not the second one, basically, if it has dash, it is not a match.
Thanks.
Something of this sort is what you're looking for:
^index_[0-9]+$
Because the "_" (underscore) is covered in the word character class, there is no need to explicitly check for it. Use this regex to match on the first pattern:
^\w+\d+$

Matching parts of string that contain no consecutive dashes

I need a regex that will match strings of letters that do not contain two consecutive dashes.
I came close with this regex that uses lookaround (I see no alternative):
([-a-z](?<!--))+
Which given the following as input:
qsdsdqf--sqdfqsdfazer--azerzaer-azerzear
Produces three matches:
qsdsdqf-
sqdfqsdfazer-
azerzaer-azerzear
What I want however is:
qsdsdqf-
-sqdfqsdfazer-
-azerzaer-azerzear
So my regex loses the first dash, which I don't want.
Who can give me a hint or a regex that can do this?
This should work:
-?([^-]-?)*
It makes sure that there is at least one non-dash character between every two dashes.
Looks to me like you do want to match strings that contain double hyphens, but you want to break them into substrings that don't. Have you considered splitting it between pairs of hyphens? In other words, split on:
(?<=-)(?=-)
As for your regex, I think this is what you were getting at:
(?:[^-]+|-(?<!--)|\G-)+
The -(?<!--) will match one hyphen, but if the next character is also a hyphen the match ends. Next time around, \G- picks up the second hyphen because it's the next character; the only way that can happen (except at the beginning of the string) is if a previous match broke off at that point.
Be aware that this regex is more flavor dependent than most; I tested it in Java, but not all flavors support \G and lookbehinds.