Is there a way to use periodicity in a regular expression? - regex

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?

First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.

What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)

Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

Related

Regex needed to match individual values from comma separated list

I need a regex to match and extract all the values from a comma separated list.
The maximum size of the list is always the same.
For example the if max size is 3 the following lists can exist:
VALUE1,VALUE2,VALUE3
VALUE1,VALUE2
VALUE1
I need, if possible, a regex to extract in capturing groups the elements above, no matter of what list is given as input.
I have tried with something simple like:
(.*)(,?)(.*)(,?)(.*)
But it matches the whole thing, no values are extracted. I don't understand why the ? doesn't work correctly in this case.
What I need: to apply the same regex for all the lists and extract the values.
Given the regex is used (.*)(,?)(.*)(,?)(.*)
Given the input list is VALUE1,VALUE2,VALUE3
Then I expect that group1=VALUE1, group3=VALUE2, group5=VALUE3
Given the regex is used (.*)(,?)(.*)(,?)(.*)
Given the input list is VALUE1,VALUE2
Then I expect that group1=VALUE1, group3=VALUE2
Given the regex is used (.*)(,?)(.*)(,?)(.*)
Given the input list is VALUE1
Then I expect that group1=VALUE1
You can make some of the groups optional, and simplify slightly by avoiding parentheses where they are unnecessary. You should also make your regex unambiguous; .* can match a comma, and the regex engine will do that if it needs to do that in order to find a match. You will also want to add anchors to the expression to avoid matching a substring of a longer line.
^([^,]*)(,([^,]*)(,([^,]*))?)?$
Demo: https://regex101.com/r/swUn3B/2
(where I had to add \n to the character class [^,\n] to avoid straddling newlines in the test data).
The fundamental problem with your attempt is that ,? is allowed to match nothing, and so the regex engine will do that if it's needed to achieve a match. The trick in this solution is to only make the entire group optional: if there is no comma, that's fine; but if there is a comma, it needs to be followed by another group of non-comma characters. We repeat this as many times as necessary to capture the specified maximum number of non-comma groups.

how to make a regex to validate a username

I've written this regex
/(?=.*[a-z])(?!.*[A-Z])([\w\_\-\.].{3,10})/g
to check the following conditions
>has minimum of 3 and maximum of 10 characters.
>must contain atleast a lowercase alphabet.
>must contain only lowercase alphabets, '_', '-', '.' and digits.
this works but returnes true even if there is more than 10 characters.
I would like a new or modified regular expression to check the above given conditions.
add hanchors
remove the last dot
the negative lookahead is useless is you use a correct character class
This regex will work:
^(?=.*[a-z])[a-z0-9_.-]{3,10}$
Demo & explanation
You can use this REGEX
REGEX Demo
([a-z]{1}[0-9a-z_.-]{2,9})
, Tried text
username77
usr
username10
user_test
usr.1000
There are many ways of doing this. I believe the common characteristic is they will all have a positive lookahead. Here is another.
^(?=.{3,10}$)[a-z\d_.-]*[a-z][a-z\d_.-]*$
Demo
Notice that [a-z\d_.-]* appears twice. Some regex engines support subroutines (or subexpressions) that allow one to save a repeated part of the regex to a numbered or named capture group for reuse later in the string. When using the PCRE engine, for example, you could write
^(?=.{3,10}$)([a-z\d_.-]*)[a-z](?1)$
Demo
(?1) is replaced by the regex tokens that matched the string saved to capture group 1 ([a-z\d_.-]*), as contrasted with \1, which references the content of capture group 1. The use of subroutines can shorten the regex expression, but more importantly it reduces the chance of errors when changes are made to the regex's tokens that are repeated.

Cleaning up a regular expression which has lots of repetition

I am looking to clean up a regular expression which matches 2 or more characters at a time in a sequence. I have made one which works, but I was looking for something shorter, if possible.
Currently, it looks like this for every character that I want to search for:
([A]{2,}|[B]{2,}|[C]{2,}|[D]{2,}|[E]{2,}|...)*
Example input:
AABBBBBBCCCCAAAAAADD
See this question, which I think was asking the same thing you are asking. You want to write a regex that will match 2 or more of the same character. Let's say the characters you are looking for are just capital letters, [A-Z]. You can do this by matching one character in that set and grouping it by putting it in parentheses, then matching that group using the reference \1 and saying you want two or more of that "group" (which is really just the one character that it matched).
([A-Z])\1{1,}
The reason it's {1,} and not {2,} is that the first character was already matched by the set [A-Z].
Not sure I understand your needs but, how about:
[A-E]{2,}
This is the same as yours but shorter.
But if you want multiple occurrences of each letter:
(?:([A-Z])\1+)+
where ([A-Z]) matches one capital letter and store it in group 1
\1 is a backreference that repeats group 1
+ assume that are one or more repetition
Finally it matches strings like the one you've given: AABBBBBBCCCCAAAAAADD
To be sure there're no other characters in the string, you have to anchor the regex:
^(?:([A-Z])\1+)+$
And, if you wnat to match case insensitive:
^(?i)(?:([A-Z])\1+)+$

optimizing regex to fine key=value pairs, space delimited

shortend URL with my current regex in regexpal:
http://bit.ly/1jbOFGd
I have a line of key=value pairs, space delimited. Some values contain spaces and punctuation so I do a positive lookahead to check for the existence of another key.
I want to tokenize the key and value, which I later convert to a dict in python.
My guess is that I can speed this up by getting rid of .*? but how? In python I convert 10,000 of these lines in 4.3 seconds. I'd like to double or triple that speed by making this regex match more efficient.
Update:
(?<=\s|\A)([^\s=]+)=(.*?)(?=(?:\s[^\s=]+=|$))
I would think this one is more efficient than yours (even though it still uses the .*? for the value, its lookahead is no where near as complex and doesn't use a lazy modifier), but I'll need you to test. This does the same as my original expression, but handles values differently. It uses a lazy .*? match followed by a lookahead that is either a space, followed by a key, followed by a = OR the end of the string. Notice I always define a key as [^\s=]+, so keys cannot contain an equal sign or whitespace (being this specific helps us avoid lazy matches).
Source
Original:
Are there some rules I am missing that you need by doing something this simple?
(?<=\s|\A)([^=]+)=([\S]+)
This starts with a lookbehind of either a space character (\s) or the beginning of the string (\A). Then we match everything except =, followed by a =, and match everything except whitespace (\s).
"Lookbehind" (related to 'lookahead' and 'lookaround') is the key 'regular expression' concept to read up on here - it let's you match and skip individual components of the string.
Good examples here: http://www.rexegg.com/regex-lookarounds.html.

Matching parts of string that contain no consecutive dashes

I need a regex that will match strings of letters that do not contain two consecutive dashes.
I came close with this regex that uses lookaround (I see no alternative):
([-a-z](?<!--))+
Which given the following as input:
qsdsdqf--sqdfqsdfazer--azerzaer-azerzear
Produces three matches:
qsdsdqf-
sqdfqsdfazer-
azerzaer-azerzear
What I want however is:
qsdsdqf-
-sqdfqsdfazer-
-azerzaer-azerzear
So my regex loses the first dash, which I don't want.
Who can give me a hint or a regex that can do this?
This should work:
-?([^-]-?)*
It makes sure that there is at least one non-dash character between every two dashes.
Looks to me like you do want to match strings that contain double hyphens, but you want to break them into substrings that don't. Have you considered splitting it between pairs of hyphens? In other words, split on:
(?<=-)(?=-)
As for your regex, I think this is what you were getting at:
(?:[^-]+|-(?<!--)|\G-)+
The -(?<!--) will match one hyphen, but if the next character is also a hyphen the match ends. Next time around, \G- picks up the second hyphen because it's the next character; the only way that can happen (except at the beginning of the string) is if a previous match broke off at that point.
Be aware that this regex is more flavor dependent than most; I tested it in Java, but not all flavors support \G and lookbehinds.