Regex needed to match individual values from comma separated list - regex

I need a regex to match and extract all the values from a comma separated list.
The maximum size of the list is always the same.
For example the if max size is 3 the following lists can exist:
VALUE1,VALUE2,VALUE3
VALUE1,VALUE2
VALUE1
I need, if possible, a regex to extract in capturing groups the elements above, no matter of what list is given as input.
I have tried with something simple like:
(.*)(,?)(.*)(,?)(.*)
But it matches the whole thing, no values are extracted. I don't understand why the ? doesn't work correctly in this case.
What I need: to apply the same regex for all the lists and extract the values.
Given the regex is used (.*)(,?)(.*)(,?)(.*)
Given the input list is VALUE1,VALUE2,VALUE3
Then I expect that group1=VALUE1, group3=VALUE2, group5=VALUE3
Given the regex is used (.*)(,?)(.*)(,?)(.*)
Given the input list is VALUE1,VALUE2
Then I expect that group1=VALUE1, group3=VALUE2
Given the regex is used (.*)(,?)(.*)(,?)(.*)
Given the input list is VALUE1
Then I expect that group1=VALUE1

You can make some of the groups optional, and simplify slightly by avoiding parentheses where they are unnecessary. You should also make your regex unambiguous; .* can match a comma, and the regex engine will do that if it needs to do that in order to find a match. You will also want to add anchors to the expression to avoid matching a substring of a longer line.
^([^,]*)(,([^,]*)(,([^,]*))?)?$
Demo: https://regex101.com/r/swUn3B/2
(where I had to add \n to the character class [^,\n] to avoid straddling newlines in the test data).
The fundamental problem with your attempt is that ,? is allowed to match nothing, and so the regex engine will do that if it's needed to achieve a match. The trick in this solution is to only make the entire group optional: if there is no comma, that's fine; but if there is a comma, it needs to be followed by another group of non-comma characters. We repeat this as many times as necessary to capture the specified maximum number of non-comma groups.

Related

Regex check for name Initials

I am trying to create a regex that checks if one or more middle-name initials have the following stucture:
INITIAL.[BLANK]INITIAL.[BLANK]INITIAL.
There can be multiple Initials as long as they are followed by a dot (.) - blank spaces are only allowed between two initials (e.g. L. B.)
It should not be possible to have a space after an initial if there's no other initial following.
At the moment, I have the following Regex which doesn't work perfectly as of now:
([A-Z]\. (?=[A-Z]|$))+
Using regex101, this is an example:
As you can see, it still matches the string even though there's a blank space at the end, without having another Initial following.
I am not sure why this is happening. I am just learning regex and would be glad if anyone could provide me with a solution to my problem :)
The error you're seeing is because at the last step, your expression reads in [A-Z]\. looks ahead for $ (and finds it). I would express the pattern this way: (?:[A-Z]\. )*[A-Z]\.$. Treat the last initial specially because it does not have a final space.
The pattern you tried ([A-Z]\. (?=[A-Z]|$))+ uses a repeated capturing group which will give you the value of the last iteration.
In that repetition you match a space <code>[A-Z]\. </code> effectively meaning that it should be present in the match.
You could repeat 0+ times matching a char [A-Z] followed by a space to match multiple occurrences.
Then match a char [A-Z] asserting what is on the right is not a non whitespace char.
\b(?:[A-Z]\. )*[A-Z]\.(?!\S)
Regex demo
If there can be multiple spaces but it should not match a newline:
\b(?:[A-Z]\.[^\S\r\n]*)*[A-Z]\.(?!\S)
Regex demo

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

How to do a complex multiple if-then-else regex?

I need to do a complex if-then-else with five preferential options. Suppose I first want to match abc but if it's not matched then match a.c, then if it's not matched def, then %##, then 1z;.
Can I nest the if-thens or how else would it be accomplished? I've never used if-thens before.
For instance, in the string 1z;%##defarcabcaqcdef%##1z; I would like the output abc.
In the string 1z;%##defarcabaqcdef%##1z; I would like the output arc.
In the string 1z;%##defacabacdef%##1z; I would like the output def.
In the string 1z;##deacabacdf%##1z; I would like the output %##.
In the string foo;%#dfaabaef##1z;barbbbaarr3 I would like the output 1z;.
You need to force individual matching of each option and not put them together. Doing so as such: .*?(?:x|y|z) will match the first occurrence where any of the options are matched. Using that regex against a string, i.e. abczx will return z because that's the first match it found. To force prioritization you need to combine the logic of .*? and each option such that you get a regex resembling .*?x|.*?y|.*?z. It will try each option one by one until a match is found. So if x doesn't exist, it'll continue to the next option, etc.
See regex in use here
(?m)^(?:.*?(?=abc)|.*?(?=a.c)|.*?(?=def)|.*?(?=%##)|.*?(?=1z;))(.{3})
(?m) Enables multiline mode so that ^ and $ match the start/end of each line
(?:.*?(?=abc)|.*?(?=a.c)|.*?(?=def)|.*?(?=%##)|.*?(?=1z;)) Match either of the following options
.*?(?=abc) Match any character any number of times, but as few as possible, ensuring what follows is abc literally
.*?(?=a.c) Match any character any number of times, but as few as possible, ensuring what follows is a, any character, then c
.*?(?=def) Match any character any number of times, but as few as possible, ensuring what follows is def literally
.*?(?=%##) Match any character any number of times, but as few as possible, ensuring what follows is %## literally
.*?(?=1z;) Match any character any number of times, but as few as possible, ensuring what follows is 1z; literally
(.{3}) Capture any character exactly 3 times into capture group 1
If the options vary in length, you'll have to capture in different groups as seen here:
(?m)^(?:.*?(abc)|.*?(a.c)|.*?(def)|.*?(%##)|.*?(1z;))

Regexp: How to match a string that doesn't have any character repeated 3 times?

I'm trying to make a single pattern that will validate an input string. The validation rule does not allow any character to be repeated more that 3 times in a row.
For example:
Aabcddee - is valid.
Aabcddde - is not valid, because of 3 d chracters.
The goal is to provide a RegExp pattern that could match one of above examples, but not both. I know I could use back-references such as ([a-z])\1{1,2} but this matches only sequential characters. My problem is that I cannot figure out how to make a single pattern for that. I tried this, but I don't quite get why it isn't working:
^(([a-z])\1{1,2})+$
Here I try to match any character that is repeated 1 or 2 times in the internal group, then I match that internal group if it's repeated multiple times. But it's not working that way.
Thanks.
To check that the string does not have a character (of any kind, even new line) repeated 3 times or more in a row:
/^(?!.*(.)\1{2})/s
You can also check that the input string does NOT have any match to this regex. In this case, you can also know the character being repeated 3 times or more in a row. Notice that this is exactly the same as above, except that the regex inside the negative look-ahead (?!pattern) is taken out.
/^.*(.)\1{2}/s
If you want to add validation that the string only contains characters from [a-z], and you consider aaA to be invalid:
/^(?!.*(.)\1{2})[a-z]+$/i
As you can see i flag (case-insensitive) affect how the text captured is compared against the current input.
Change + to * if you want to allow empty string to pass.
If you want to consider aaA to be valid, and you want to allow both upper and lower case:
/^(?!.*(.)\1{2})[A-Za-z]+$/
At first look, it might seem to be the same as the previous one, but since there is no i flag, the text captured will not subject to case insensitive matching.
Below is failed answer, you can ignore it, but you can read it for fun.
You can use this regex to check that the string does not have 3 repeated character (of any kind, even new line).
/^(?!.*(.)(?:.*\1){2})/s
You can also check that the input string does NOT have any match to this regex. In this case, you can also know the character being repeated more than or equal to 3 times. Notice that this is exactly the same as above, except that the regex inside the negative look-ahead (?!pattern) is taken out.
/^.*(.)(?:.*\1){2}/s
If you want to add validation that the string only contains characters from [a-z], and you consider aaA to be invalid:
/^(?!.*(.)(?:.*\1){2})[a-z]+$/i
As you can see i flag (case-insensitive) affect how the text captured is compared against the current input.
If you want to consider aaA to be valid, and you want to allow both upper and lower case:
/^(?!.*(.)(?:.*\1){2})[A-Za-z]+$/
At first look, it might seem to be the same as the previous one, but since there is no i flag, the text captured will not subject to case insensitive matching.
From your question I get that you want to match
only strings consisting of chars from [A-Za-z] AND
only strings which have no sequence of the same character with a length of 3 or more
Then this regexp should work:
^(?:([A-Za-z])(?:(?!\1)|\1(?!\1)))+$
(Example in perl)

How to write hive regex to match condition 1 OR condition 2 and return whichever matches?

I need to have "or" logic in my regexp.
For example, from "foobar435" I would need the three numbers, so "435"
But from "barfoo543" I would need the three letters before the three numbers, so "foo"
Individually, the regexes would be "foobar([0-9]){3}" to get the first case, and "[a-zA-Z]{3}([0-9]{3})[a-zA-Z]{3}" to get the second case. How do I get both cases at once with one regexp? So, if the first regexp matches then return "435", but if not, return "foo"?
I am using hive so ideally I want to make one call only. So far I have...
REGEXP_EXTRACT(myString, 'foobar([0-9]){3}', 1) AS columnName
Not sure how to add the second case into this. Thanks!
You can use lookarounds for this.
In your first case, you want to match three digits preceded by "foobar" (use lookbehind):
(?<=foobar)[0-9]{3}
In your second case, you want to match three letters preceded by three letters (use lookbehind) and followed by three digits (use lookahead):
(?<=[a-zA-Z]{3})[a-zA-Z]{3}(?=\d{3})
Note that, if I interpreted your requirements correctly, it looks like you flipped the numeric part with the second alpha part in your expression.
Now that you have your two expressions, you just need to combine them with an 'or':
(?<=foobar)[0-9]{3}|(?<=[a-zA-Z]{3})[a-zA-Z]{3}(?=\d{3})
One thing to be aware of is that this will also match words with additional word characters on either end, ie "xfoobar435x". If this is undesirable, add a word boundary \b to the beginnings of the lookbehinds and to the end of the lookahead.