I am trying to figure out a couple regular expressions for the below cases:
Lines with a length divisible by n, but not by m for integers n and m
Lines that do not contain a certain number n of a given character,
but may contain more or less
I am a newcomer and would appreciate any clarification on these.
I've used JavaScript for my examples.
For the first one the key is to note that 'multiples' are just repeats. So using /(...)+/ will match 3 characters and then repeat that match as many times as it can. Each matching group doesn't need to be the same set of 3 characters but they do need to be consecutive. Suitable anchoring using ^ and $ ensures you're checking the exact string length and ?: can be used to negate.
e.g. Multiple of 5 but not 3:
/^(?!(.{3})+$)(.{5})+$/gm
Note that JavaScript uses / to mark the beginning and end of the expression and gm are modifiers to perform global, multiline matches. I wasn't clear what you meant by matching 'lines' so I've assumed the string itself contains newline characters that must be taken into consideration. If you had, say, and array of lines and could check each one individually then things get slightly easier, or a lot easier in the case of your second question.
Demo for the first question:
var chars = '12345678901234567890',
str = '';
for (var i = 1 ; i <= chars.length ; ++i) {
str += chars.slice(0, i) + '\n';
}
console.log('Full text is:');
console.log(str);
console.log('Lines with length that is a multiple of 2 but not 3:');
console.log(lineLength(str, 2, 3));
console.log('Lines with length that is a multiple of 3 but not 2:');
console.log(lineLength(str, 3, 2));
console.log('Lines with length that is a multiple of 5 but not 3:');
console.log(lineLength(str, 5, 3));
function lineLength(str, multiple, notMultiple) {
return str.match(new RegExp('^(?!(.{' + notMultiple + '})+$)(.{' + multiple + '})+$', 'gm'));
}
For the second question I couldn't come up with a nice way to do it. This horror show is what I ended up with. It wasn't too bad to match n occurrences of a particular character in a line but matching 'not n' proved difficult. I ended up matching {0,n-1} or {n+1,} but the whole thing doesn't feel so great to me. I suspect there's a cleverer way to do it that I'm currently not seeing.
var str = 'a\naa\naaa\naaaa\nab\nabab\nababab\nabababab\nba\nbab\nbaba\nbbabbabb';
console.log('Full string:');
console.log(str);
console.log('Lines with 1 occurrence of a:');
console.log(mOccurrences(str, 'a', 1));
console.log('Lines with 2 occurrences of a:');
console.log(mOccurrences(str, 'a', 2));
console.log('Lines with 3 occurrences of a:');
console.log(mOccurrences(str, 'a', 3));
console.log('Lines with not 1 occurrence of a:');
console.log(notMOccurrences(str, 'a', 1));
console.log('Lines with not 2 occurrences of a:');
console.log(notMOccurrences(str, 'a', 2));
console.log('Lines with not 3 occurrences of a:');
console.log(notMOccurrences(str, 'a', 3));
function mOccurrences(str, character, m) {
return str.match(new RegExp('^[^' + character + '\n]*(' + character + '[^' + character + '\n]*){' + m + '}[^' + character + '\n]*$', 'gm'));
}
function notMOccurrences(str, character, m) {
return str.match(new RegExp('^([^' + character + '\n]*(' + character + '[^' + character + '\n]*){0,' + (m - 1) + '}[^' + character + '\n]*|[^' + character + '\n]*(' + character + '[^' + character + '\n]*){' + (m + 1) + ',}[^' + character + '\n]*)$', 'gm'));
}
The key to how that works is that it tries to find n occurrences of a separated by sequences of [^a], with \n thrown in to stop it walking onto the next line.
In a real world scenario I would probably do the splitting into lines first as that makes things much easier. Counting the number of occurrences of a particular character is then just:
str.replace(/[^a]/g, '').length;
// Could use this instead but note in JS it'd fail if length is 0
str.match(/a/g, '').length;
Again, this assumes a JavaScript environment. If you were using regexes in an environment where you could literally only pass in the regex as an argument then it's back to my earlier horror show.
Related
I have a list of dictionary words, I would like to find any word that consists of (some or all) certain characters of a source word in any order :
For Example:
Characters (source word) to look for : stainless
Found Words : stainless, stain, net, ten, less, sail, sale, tale, tales, ants, etc.
Also if a letter is found once in the source word it can't be repeated in the found word
Unacceptable words to find : tent (t is repeated), tall (l is repeated) , etc.
Acceptable words to find : less (s is already repeated in the source word), etc.
You could take this approach:
Match any sequence of characters that are in the search word, requiring that the match is a word (word-boundaries)
Prohibit that a certain character occurs more often than it is present in the search word, using a negative look-ahead. Do this for every character that is in the search word.
For the given example the regular expression would be:
(?!(\S*s){4}|(\S*t){2}|(\S*a){2}|(\S*i){2}|(\S*n){2}|(\S*l){2}|(\S*e){2})\b[stainless]+\b
The biggest part of the pattern deals with the negative look-ahead. For example:
(\S*s){4} would match four times an 's' in a single word.
(?! | ) places these patterns as different options in a negative look-ahead so that none of them should match.
Automation
It is clear that making such a regular expression for a given word needs some work, so that is where you could use some automation. Notepad++ cannot help with that, but in a programming environment it is possible. Here is a little snippet in JavaScript that will give you the regular expression that corresponds to a given search word:
function regClassEscape(s) {
// Escape "[" and "^" and "-":
return s.replace(/[\]^-]/g, "\\$&");
}
function buildRegex(searchWord) {
// get frequency of each letter:
let freq = {};
for (let ch of searchWord) {
ch = regClassEscape(ch);
freq[ch] = (freq[ch] ?? 0) + 1;
}
// Produce negative options (too many occurrences)
const forbidden = Object.entries(freq).map(([ch, count]) =>
"(\\S*[" + ch + "]){" + (count + 1) + "}"
).join("|");
// Produce character set
const allowed = Object.keys(freq).join("");
return "(?!" + forbidden + ")\\b[" + allowed + "]+\\b";
}
// I/O management
const [input, output] = document.querySelectorAll("input,div");
input.addEventListener("input", refresh);
function refresh() {
if (/\s/.test(input.value)) {
output.textContent = "Input should have no white space!";
} else {
output.textContent = buildRegex(input.value);
}
}
refresh();
input { width: 100% }
Search word:<br>
<input value="stainless">
Regular expression:
<div></div>
I continue trying to perform string format matching using RegExp in VBScript & VB6. I am now trying to match a short, single-line string formatted as:
Seven characters:
a. Six alphanumeric plus one "-" OR
b. Five alphanumeric plus two "-"
Three numbers
Two letters
Literal "65"
A two-digit hex number.
Examples include 123456-789LM65F2, 4EF789-012XY65A5, A2345--789AB65D0 & 23456--890JK65D0.
The RegExp pattern ([A-Z0-9\-]{12})([65][A-F0-9]{2}) lumps (1) - (3) together and finds these OK.
However, if I try to:
c) Break (3) out w/ pattern ([A-Z0-9\-]{10})([A-Z]{2})([65][A-F0-9]{2}),
d) Break out both (2) & (3) w/ pattern ([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}), or
e) Tighten up (1) with alternation pattern ([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})
it refuses to find any of them.
What am I doing wrong? Following is a VBScript that runs and checks these.
' VB Script
Main()
Function Main() ' RegEx_Format_sample.vbs
'Uses two paterns, TestPttn for full format accuracy check & SplitPttn
'to separate the two desired pieces
Dim reSet, EtchTemp, arrSplit, sTemp
Dim sBoule, sSlice, idx, TestPttn, SplitPttn, arrMatch
Dim arrPttn(3), arrItems(3), idxItem, idxPttn, Msgtemp
Set reSet = New RegExp
' reSet.IgnoreCase = True ' Not using
' reSet.Global = True ' Not using
' load test case formats to check & split
arrItems(0) = "0,6 nums + 1 '-',123456-789LM65F2"
arrItems(1) = "1,6 chars + 1 '-',4EF789-012XY65A5"
arrItems(2) = "2,5 chars + 2 '-',A2345--789AB65D0"
arrItems(3) = "3,5 nums + 2 '-',23456--890JK65D0"
SplitPttn = "([A-Z0-9]{5,6})[-]{1,2}([A-Z0-9]{9})" ' split pattern has never failed to work
' load the patterns to try
arrPttn(0) = "([A-Z0-9\-]{12})([65][A-F0-9]{2})"
arrPttn(1) = "([A-Z0-9\-]{10}[A-Z]{2})([65][A-F0-9]{2})"
arrPttn(2) = "([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})"
arrPttn(3) = "([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})"
For idxPttn = 0 To 3 ' select Test pattern
TestPttn = arrPttn(idxPttn)
TestPttn = TestPttn & "[%]" ' append % "ender" char
SplitPttn = SplitPttn & "[%]" ' append % "ender" char
For idxItem = 0 To 3
reSet.Pattern = TestPttn ' set to Test pattern
sTemp = arrItems(idxItem )
arrSplit = Split(sTemp, ",") ' arrSplit is Split array
EtchTemp = arrSplit(2) & "%" ' append % "ender" char to Item sub (2) as the "phrase" under test
If reSet.Test(EtchTemp) = False Then
MsgBox("RegEx " & TestPttn & " false for " & EtchTemp & " as " & arrSplit(1) )
Else ' test OK; now switch to SplitPttn
reSet.Pattern = SplitPttn
Set arrMatch = reSet.Execute(EtchTemp) ' run Pttn as Exec this time
If arrMatch.Count > 0 then ' If test OK then Count s/b > 0
Msgtemp = ""
Msgtemp = "RegEx " & TestPttn & " TRUE for " & EtchTemp & " as " & arrSplit(1)
For idx = 0 To arrMatch.Item(0).Submatches.Count - 1
Msgtemp = Msgtemp & Chr(13) & Chr(10) & "Split segment " & idx & " as " & arrMatch.Item(0).submatches.Item(idx)
Next
MsgBox(Msgtemp)
End If ' Count OK
End If ' test OK
Next ' idxItem
Next ' idxPttn
End Function
Try this Regex:
(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--)[0-9]{3}[A-Z]{2}65[0-9A-F]{2}
Click for Demo
Explanation:
(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--) - matches either 6 Alphanumeric characters followed by a - or 5 Alphanumeric characters followed by a --
[0-9]{3} - matches 3 Digits
[A-Z]{2} - matches 2 Letters
65 - matches 65 literally
[0-9A-F]{2} - matches 2 HEX symbols
You can get some idea from the following code:
VBScript Code:
Option Explicit
Dim objReg, strTest
strTest = "123456-789LM65F2" 'Change the value as per your requirements. You can also store a list of values in an array and run the code in loop
set objReg = new RegExp
objReg.Global = True
objReg.IgnoreCase = True
objReg.Pattern = "(?:[A-Z0-9]{6}-|[A-Z0-9]{5}--)[0-9]{3}[A-Z]{2}65[0-9A-F]{2}"
if objReg.test(strTest) then
msgbox strTest&" matches with the Pattern"
else
msgbox strTest&" does not match with the Pattern"
end if
set objReg = Nothing
Your patterns do not work because:
([A-Z0-9\-]{12})([65][A-F0-9]{2}) - matches 12 occurrences of either an AlphaNumeric character or - followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9\-]{10}[A-Z]{2})([65][A-F0-9]{2}) - matches 10 occurrences of either an AlphaNumeric character or - followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9\-]{7})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}) - matches 7 occurrences of either an AlphaNumeric character or - followed by 3 digits followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2}) - matches either 5 occurrences of an AlphaNumeric character followed by -- or 6 occurrences of an Alphanumeric followed by a -. This is then followed by 3 digits followed by 2 Letters followed by either 6 or 5 followed by 2 HEX characters
Try this pattern :
(([A-Z0-9]{5}--)|([A-Z0-9]{6}-))[0-9]{3}[A-Z]{2}65[0-9A-F]{2}
Or, if the last part doesn't like the [A-F]
(([A-Z0-9]{5}--)|([A-Z0-9]{6}-))[0-9]{3}[A-Z]{2}65[0-9ABCDEF]{2}
All, tanx again for your help!!
trincot, everything in each arrItems() between the commas, incl the the "plus", is merely part of a shorthand description of each item's characteristics, such as "5 characters plus 2 dashes".
Gurman, your pttn breakdowns were helpful, but, if I read it right, the addition of the ? prefix is a "Match zero or one occurrences" and this must match exactly one occurrence. Also, my 1st pattern (matches 12) actually DID work for all my test cases.
jNevill, & JMichelB your suggestions are very close to what I ended up with.
I was "over-classing". After some tinkering, I was able to get the Test Pttn to successfully recognize these test cases by taking the [65] out of the [] in my original Alternation pattern. That is I went from ([65]) to (65) and Zammo! it worked.
Orig pattern:
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})([65][A-F0-9]{2})
Wkg pattern:
([A-Z0-9]{5}[-]{2}|[A-Z0-9]{6}[-]{1})([0-9]{3})([A-Z]{2})(65)([A-F0-9]{2})
Oh, and I moved the
SplitPttn = SplitPttn & "[%]" ' append % "ender" char
stmt up out of the For...Next loop. That helped w/ the splitting.
T-Bone
I want to match the expression of var * var, var * num, num * var and num * num separately, i.e. using four different regular expression.
my var could be s1,s2,...,S1,S2,...,v1,v2,...V1,V2....
my num could be any float number
for var*var, I use:
[vVsS][0-9]+\s*[*/]\s*[vVsS][0-9]+
and it works well
for var*num and num*var, I use:
[vVsS][0-9]+\s*[*/]\s*[0-9]+[.]?[0-9]*
and
[0-9]+[.]?[0-9]*\s*[*/]\s*(vVsS)[0-9]+
but it returns nothing when I try the input:
2*4 + s1* 7 + v3 * 2 + s3 * V2 + 5*v1
UPDATE: I could do that now.
For example, for the case of var * num
[vVsS][0-9]+\s*[*/]\s*[0-9]+(?:[.][0-9]+)? works well, as Wiktor Stribiżew suggests in comment.
But I didn't find some explanation on the use of(?:) online. Anyone has idea on that?
You may use
[vVsS][0-9]+\s*[*/]\s*[0-9]+(?:[.][0-9]+)?
The pattern matches:
[vVsS][0-9]+ - a letter from the character class (either v, V, s or S) followed with one or more digits
\s*[*/]\s* - a / or * enclosed with zero or more whitespaces
[0-9]+ - one or more digits
(?:[.][0-9]+)? - an optional non-capturing group matching a dot and one or more digits.
How to match any character which repeats n times?
Example:
for input: abcdbcdcdd
for n=1: ..........
for n=2: .........
for n=3: .. .....
for n=4: . . ..
for n=5: no matches
After several hours my best is this expression
(\w)(?=(?:.*\1){n-1,}) //where n is variable
which uses lookahead. However the problem with this expression is this:
for input: abcdbcdcdd
for n=1 ..........
for n=2 ... .. .
for n=3 .. .
for n=4 .
for n=5 no matches
As you can see, when lookahead matches for a character, let's look for n=4 line, d's lookahead assertion satisfied and first d matched by regex. But remaining d's are not matched because they don't have 3 more d's ahead of them.
I hope I stated the problem clearly. Hoping for your solutions, thanks in advance.
let's look for n=4 line, d's lookahead assertion satisfied
and first d matched by regex.
But remaining d's are not matched because they don't have 3 more d's
ahead of them.
And obviously, without regex, this is a very simple string manipulation
problem. I'm trying to do this with and only with regex.
As with any regex implementation, the answer depends on the regex flavour. You could create a solution with .net regex engine, because it allows variable width lookbehinds.
Also, I'll provide a more generalized solution below for perl-compatible/like regex flavours.
.net Solution
As #PetSerAl pointed out in his answer, with variable width lookbehinds, we can assert back to the beggining of the string, and check there are n occurrences.
ideone demo
regex module in Python
You can implement this solution in python, using the regex module by Matthew Barnett, which also allows variable-width lookbehinds.
>>> import regex
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){2})\A.*)', 'abcdbcdcdd')
['b', 'c', 'd', 'b', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){3})\A.*)', 'abcdbcdcdd')
['c', 'd', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){4})\A.*)', 'abcdbcdcdd')
['d', 'd', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){5})\A.*)', 'abcdbcdcdd')
[]
Generalized Solution
In pcre or any of the "perl-like" flavours, there is no solution that would actually return a match for every repeated character, but we could create one, and only one, capture for each character.
Strategy
For any given n, the logic involves:
Early matches: Match and capture every character followed by at least n more occurences.
Final captures:
Match and capture a character followed by exactly n-1 occurences, and
also capture every one of the following occurrences.
Example
for n = 3
input = abcdbcdcdd
The character c is Matched only once (as final), and the following 2 occurrences are also Captured in the same match:
abcdbcdcdd
M C C
and the character d is (early) Matched once:
abcdbcdcdd
M
and (finally) Matched one more time, Capturing the rest:
abcdbcdcdd
M CC
Regex
/(\w) # match 1 character
(?:
(?=(?:.*?\1){≪N≫}) # [1] followed by other ≪N≫ occurrences
| # OR
(?= # [2] followed by:
(?:(?!\1).)*(\1) # 2nd occurence <captured>
(?:(?!\1).)*(\1) # 3rd occurence <captured>
≪repeat previous≫ # repeat subpattern (n-1) times
# *exactly (n-1) times*
(?!.*?\1) # not followed by another occurence
)
)/xg
For n =
/(\w)(?:(?=(?:.*?\1){2})|(?=(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){3})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){4})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
... etc.
Pseudocode to generate the pattern
// Variables: N (int)
character = "(\w)"
early_match = "(?=(?:.*?\1){" + N + "})"
final_match = "(?="
for i = 1; i < N; i++
final_match += "(?:(?!\1).)*(\1)"
final_match += "(?!.*?\1))"
pattern = character + "(?:" + early_match + "|" + final_match + ")"
JavaScript Code
I'll show an implementation using javascript because we can check the result here (and if it works in javascript, it works in any perl-compatible regex flavour, including .net, java, python, ruby, perl, and all languages that implemented pcre, among others).
var str = 'abcdbcdcdd';
var pattern, re, match, N, i;
var output = "";
// We'll show the results for N = 2, 3 and 4
for (N = 2; N <= 4; N++) {
// Generate pattern
pattern = "(\\w)(?:(?=(?:.*?\\1){" + N + "})|(?=";
for (i = 1; i < N; i++) {
pattern += "(?:(?!\\1).)*(\\1)";
}
pattern += "(?!.*?\\1)))";
re = new RegExp(pattern, "g");
output += "<h3>N = " + N + "</h3><pre>Pattern: " + pattern + "\nText: " + str;
// Loop all matches
while ((match = re.exec(str)) !== null) {
output += "\nPos: " + match.index + "\tMatch:";
// Loop all captures
x = 1;
while (match[x] != null) {
output += " " + match[x];
x++;
}
}
output += "</pre>";
}
document.write(output);
Python3 code
As requested by the OP, I'm linking to a Python3 implementation in ideone.com
Regular expressions (and finite automata) are not able to count to arbitrary integers. They can only count to a predefined integer and fortunately this is your case.
Solving this problem is much easier if we first construct a nondeterministic finite automata (NFA) ad then convert it to regular expression.
So the following automata for n=2 and input alphabet = {a,b,c,d}
will match any string that has exactly 2 repetitions of any char. If no character has 2 repetitions (all chars appear less or more that two times) the string will not match.
Converting it to regex should look like
"^([^a]*a[^a]*a[^a]*)|([^b]*b[^b]*b[^b]*)|([^b]*c[^c]*c[^C]*)|([^d]*d[^d]*d[^d]*)$"
This can get problematic if the input alphabet is big, so that regex should be shortened somehow, but I can't think of it right now.
With .NET regular expressions you can do following:
(\w)(?<=(?=(?:.*\1){n})^.*) where n is variable
Where:
(\w) — any character, captured in first group.
(?<=^.*) — lookbehind assertion, which return us to the start of the string.
(?=(?:.*\1){n}) — lookahead assertion, to see if string have n instances of that character.
Demo
I would not use regular expressions for this. I would use a scripting language such as python. Try out this python function:
alpha = 'abcdefghijklmnopqrstuvwxyz'
def get_matched_chars(n, s):
s = s.lower()
return [char for char in alpha if s.count(char) == n]
The function will return a list of characters, all of which appear in the string s exactly n times. Keep in mind that I only included letters in my alphabet. You can change alpha to represent anything that you want to get matched.
I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");