Regex end of match with multiple OR conditions - regex

I want to seach all matches for a product reference with RegEx. The product references allways stars with "XXV-" followed by a number and in some cases with one or two chars. Some examples: "XXV-1234-AB", "XXV-1232", "XXV-12-X". The only anchor for the match is the start.
In this sample:
<div>Reference to search is XXV-1234-BH</div>
<div>Other reference to search is XXV-1235-VC
</div>
<div>And XXV-1236-HG
also</div>
I need to math all tree codes, better with a single regular expression.
I try with:
(?=XXV-).*?(?=<) that only matches the first.
(?=XXV-).*?(?=\n) matches all, but with < /div> of the first
It's posible to match "<" or "\n"?
Thanks for your time!

Why is your question about 'matching "<" or "\n"'? You know exactly what the format should be, so just build a regex for it.
The product references always stars with "XXV-" followed by a number and in some cases with one or two chars.
This can be matched by:
XXV-\d+(-[A-Z]{1,2})?
[A-Z] means "any character between A and Z". If you want to include other characters, e.g. lower-case, you could use: [a-zA-Z].
{1,2} means 1-2 of the previous pattern (in this case, [A-Z]).
? makes the capture group optional.
\d means any digit, i.e. it is equivalent to [0-9].
Here is a demo, with some test cases

Related

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

Cleaning up a regular expression which has lots of repetition

I am looking to clean up a regular expression which matches 2 or more characters at a time in a sequence. I have made one which works, but I was looking for something shorter, if possible.
Currently, it looks like this for every character that I want to search for:
([A]{2,}|[B]{2,}|[C]{2,}|[D]{2,}|[E]{2,}|...)*
Example input:
AABBBBBBCCCCAAAAAADD
See this question, which I think was asking the same thing you are asking. You want to write a regex that will match 2 or more of the same character. Let's say the characters you are looking for are just capital letters, [A-Z]. You can do this by matching one character in that set and grouping it by putting it in parentheses, then matching that group using the reference \1 and saying you want two or more of that "group" (which is really just the one character that it matched).
([A-Z])\1{1,}
The reason it's {1,} and not {2,} is that the first character was already matched by the set [A-Z].
Not sure I understand your needs but, how about:
[A-E]{2,}
This is the same as yours but shorter.
But if you want multiple occurrences of each letter:
(?:([A-Z])\1+)+
where ([A-Z]) matches one capital letter and store it in group 1
\1 is a backreference that repeats group 1
+ assume that are one or more repetition
Finally it matches strings like the one you've given: AABBBBBBCCCCAAAAAADD
To be sure there're no other characters in the string, you have to anchor the regex:
^(?:([A-Z])\1+)+$
And, if you wnat to match case insensitive:
^(?i)(?:([A-Z])\1+)+$

match the same unknown character multiple times

I have a regex problem I can't seem to solve. I actually don't know if regex can do this, but I need to match a range of characters n times at the end of a pattern.
eg. blahblah[A-Z]{n}
The problem is whatever character matches the ending range need to be all the same.
For example, I want to match
blahblahAAAAA
blahblahEEEEE
blahblahQQQQQ
but not
blahblahADFES
blahblahZYYYY
Is there some regex pattern that can do this?
You can use this pattern: blahblah([A-Z])\1+
The \1 is a back-reference to the first capture group, in this case ([A-Z]). And the + will match that character one or more times. To limit it you can replace the + with a specific number of repetitions using {n}, such as \1{3} which will match it three times.
If you need the entire string to match then be sure to prefix with ^ and end with $, respectively, so that the pattern becomes ^blahblah([A-Z])\1+$
You can read more about back-references here.
In most regex implementations, you can accomplish this by referencing a capture group in your regex. For your example, you can use the following to match the same uppercase character five times:
blahblah([A-Z])\1{4}
Note that to match the regex n times, you need to use \1{n-1} since one match will come from the capture group.
blahblah(.)\1*\b should work in nearly all language flavors. (.) captures one of anything, then \1* matches that (the first match) any number of times.
blahblah([A-Z]|[a-z])\1+
This should help.

How to write hive regex to match condition 1 OR condition 2 and return whichever matches?

I need to have "or" logic in my regexp.
For example, from "foobar435" I would need the three numbers, so "435"
But from "barfoo543" I would need the three letters before the three numbers, so "foo"
Individually, the regexes would be "foobar([0-9]){3}" to get the first case, and "[a-zA-Z]{3}([0-9]{3})[a-zA-Z]{3}" to get the second case. How do I get both cases at once with one regexp? So, if the first regexp matches then return "435", but if not, return "foo"?
I am using hive so ideally I want to make one call only. So far I have...
REGEXP_EXTRACT(myString, 'foobar([0-9]){3}', 1) AS columnName
Not sure how to add the second case into this. Thanks!
You can use lookarounds for this.
In your first case, you want to match three digits preceded by "foobar" (use lookbehind):
(?<=foobar)[0-9]{3}
In your second case, you want to match three letters preceded by three letters (use lookbehind) and followed by three digits (use lookahead):
(?<=[a-zA-Z]{3})[a-zA-Z]{3}(?=\d{3})
Note that, if I interpreted your requirements correctly, it looks like you flipped the numeric part with the second alpha part in your expression.
Now that you have your two expressions, you just need to combine them with an 'or':
(?<=foobar)[0-9]{3}|(?<=[a-zA-Z]{3})[a-zA-Z]{3}(?=\d{3})
One thing to be aware of is that this will also match words with additional word characters on either end, ie "xfoobar435x". If this is undesirable, add a word boundary \b to the beginnings of the lookbehinds and to the end of the lookahead.

Regex to validate number and letter sequence

I want a regex to validate inputs of the form AABBAAA, where A is a a letter (a-z, A-Z) and B is a digit (0-9). All the As must be the same, and so must the Bs.
If all the A's and B's are supposed to be the same, I think the only way to do it would be:
([a-zA-Z])\1([0-9])\2\1\1\1
Where \1 and \2 refer to the first and second parenthetical groupings. However, I don't think all regex engines support this.
It's really not as hard as you think; you've got most of the syntax already.
[a-zA-Z]{2}[0-9]{2}[a-zA-Z]{3}
The numbers in braces ({}) tell how many times to match the previous character or set of characters, so that matches [a-zA-Z] twice, [0-9] twice, and [a-zA-Z] three times.
Edit: If you want to make sure the matched string is not part of a longer string, you can use word boundaries; just add \b to each end of the regex:
\b[a-zA-Z]{2}[0-9]{2}[a-zA-Z]{3}\b
Now "Ab12Cde" will match but "YZAb12Cdefg" will not.
Edit 2: Now that the question has changed, backreferences are the only way to do it. edsmilde's answer should work; however, you may need to add the word boundaries to get your final solution.
\b([a-zA-Z])\1([0-9])\2\1\1\1\b
[a-zA-Z]{2}\d{2}[a-zA-Z]{3}