Matching token sequences - regex

I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.
11111
012345678901234
ab ab bc a cba
In this example, given window size=5, I would like to match cba at positions 12-14, and abc in positions 3-7.
Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?
I am hoping to implement this in Java.

Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':
(?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}
So, while basically matching any 5 characters (with .{5}), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X) matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.
Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):
public static void find(String haystack, String tokens, int windowLen) {
char[] tokenChars = tokens.toCharArray();
int hayLen = haystack.length();
int pos = 0;
nextPos:
while (pos + windowLen <= hayLen) {
for (char c : tokenChars) {
int i = haystack.indexOf(c, pos);
if (i < 0) return;
if (i - pos >= windowLen) {
pos = i - windowLen + 1;
continue nextPos;
}
}
// match found at pos
System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
pos++;
}
}

This tested Java program has a commented regex which does the trick:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "ab ab bc a cba";
Pattern p = Pattern.compile(
"# Match 5 char sequences containing: a and b and c\n" +
"(?=[abc]) # Assert first char is a, b or c.\n" +
"(?=.{0,4}a) # Assert an 'a' within 5 chars.\n" +
"(?=.{0,4}b) # Assert an 'b' within 5 chars.\n" +
"(?=.{0,4}c) # Assert an 'c' within 5 chars.\n" +
".{5} # If so, match the 5 chers.",
Pattern.COMMENTS);
Matcher m = p.matcher(s);
while (m.find()) {
System.out.print("Match = \""+ m.group() +"\"\n");
}
}
}
Note that there is another valid sequence S9:13" a cb" in your test data (before the S12:14"cba". Assuming you did not want to match this one, I added an additional constraint to filter it out, which requires that the 5 char window must begin with an a, b or c.
Here is the output from the script:
Match = "ab bc"
Match = "a cba"

Well, one possibility (albeit a completely impractical one) is simply to match against all permutations:
abc..|ab.c.|ab..c| .... etc.
This can be factorised somewhat:
ab(c..|.c.|..c)|a.(bc.|b.c .... etc.
I'm not sure if you can do better with regex.

Pattern p = Pattern.compile("(?:a()|b()|c()|.){5}\\1\\2\\3");
String s = "ab ab bc a cba";
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.group());
}
output:
ab bc
a cb
This is inspired by Recipe #5.7 in Regular Expressions Cookbook. Each back-reference (\1, \2, \3) acts like a zero-width assertion, indicating that the corresponding capturing group participated in the match, even though the group itself didn't consume any characters.
The authors warn that this trick relies on behavior that's undocumented in most flavors. It works in Java, .NET, Perl, PHP, Python and Ruby (original and Oniguruma), but not in JavaScript or ActionScript.

Related

REGEX - Matching any character which repeats n times

How to match any character which repeats n times?
Example:
for input: abcdbcdcdd
for n=1: ..........
for n=2: .........
for n=3: .. .....
for n=4: . . ..
for n=5: no matches
After several hours my best is this expression
(\w)(?=(?:.*\1){n-1,}) //where n is variable
which uses lookahead. However the problem with this expression is this:
for input: abcdbcdcdd
for n=1 ..........
for n=2 ... .. .
for n=3 .. .
for n=4 .
for n=5 no matches
As you can see, when lookahead matches for a character, let's look for n=4 line, d's lookahead assertion satisfied and first d matched by regex. But remaining d's are not matched because they don't have 3 more d's ahead of them.
I hope I stated the problem clearly. Hoping for your solutions, thanks in advance.
let's look for n=4 line, d's lookahead assertion satisfied
and first d matched by regex.
But remaining d's are not matched because they don't have 3 more d's
ahead of them.
And obviously, without regex, this is a very simple string manipulation
problem. I'm trying to do this with and only with regex.
As with any regex implementation, the answer depends on the regex flavour. You could create a solution with .net regex engine, because it allows variable width lookbehinds.
Also, I'll provide a more generalized solution below for perl-compatible/like regex flavours.
.net Solution
As #PetSerAl pointed out in his answer, with variable width lookbehinds, we can assert back to the beggining of the string, and check there are n occurrences.
ideone demo
regex module in Python
You can implement this solution in python, using the regex module by Matthew Barnett, which also allows variable-width lookbehinds.
>>> import regex
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){2})\A.*)', 'abcdbcdcdd')
['b', 'c', 'd', 'b', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){3})\A.*)', 'abcdbcdcdd')
['c', 'd', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){4})\A.*)', 'abcdbcdcdd')
['d', 'd', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){5})\A.*)', 'abcdbcdcdd')
[]
Generalized Solution
In pcre or any of the "perl-like" flavours, there is no solution that would actually return a match for every repeated character, but we could create one, and only one, capture for each character.
Strategy
For any given n, the logic involves:
Early matches: Match and capture every character followed by at least n more occurences.
Final captures:
Match and capture a character followed by exactly n-1 occurences, and
also capture every one of the following occurrences.
Example
for n = 3
input = abcdbcdcdd
The character c is Matched only once (as final), and the following 2 occurrences are also Captured in the same match:
abcdbcdcdd
M C C
and the character d is (early) Matched once:
abcdbcdcdd
M
and (finally) Matched one more time, Capturing the rest:
abcdbcdcdd
M CC
Regex
/(\w) # match 1 character
(?:
(?=(?:.*?\1){≪N≫}) # [1] followed by other ≪N≫ occurrences
| # OR
(?= # [2] followed by:
(?:(?!\1).)*(\1) # 2nd occurence <captured>
(?:(?!\1).)*(\1) # 3rd occurence <captured>
≪repeat previous≫ # repeat subpattern (n-1) times
# *exactly (n-1) times*
(?!.*?\1) # not followed by another occurence
)
)/xg
For n =
/(\w)(?:(?=(?:.*?\1){2})|(?=(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){3})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){4})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
... etc.
Pseudocode to generate the pattern
// Variables: N (int)
character = "(\w)"
early_match = "(?=(?:.*?\1){" + N + "})"
final_match = "(?="
for i = 1; i < N; i++
final_match += "(?:(?!\1).)*(\1)"
final_match += "(?!.*?\1))"
pattern = character + "(?:" + early_match + "|" + final_match + ")"
JavaScript Code
I'll show an implementation using javascript because we can check the result here (and if it works in javascript, it works in any perl-compatible regex flavour, including .net, java, python, ruby, perl, and all languages that implemented pcre, among others).
var str = 'abcdbcdcdd';
var pattern, re, match, N, i;
var output = "";
// We'll show the results for N = 2, 3 and 4
for (N = 2; N <= 4; N++) {
// Generate pattern
pattern = "(\\w)(?:(?=(?:.*?\\1){" + N + "})|(?=";
for (i = 1; i < N; i++) {
pattern += "(?:(?!\\1).)*(\\1)";
}
pattern += "(?!.*?\\1)))";
re = new RegExp(pattern, "g");
output += "<h3>N = " + N + "</h3><pre>Pattern: " + pattern + "\nText: " + str;
// Loop all matches
while ((match = re.exec(str)) !== null) {
output += "\nPos: " + match.index + "\tMatch:";
// Loop all captures
x = 1;
while (match[x] != null) {
output += " " + match[x];
x++;
}
}
output += "</pre>";
}
document.write(output);
Python3 code
As requested by the OP, I'm linking to a Python3 implementation in ideone.com
Regular expressions (and finite automata) are not able to count to arbitrary integers. They can only count to a predefined integer and fortunately this is your case.
Solving this problem is much easier if we first construct a nondeterministic finite automata (NFA) ad then convert it to regular expression.
So the following automata for n=2 and input alphabet = {a,b,c,d}
will match any string that has exactly 2 repetitions of any char. If no character has 2 repetitions (all chars appear less or more that two times) the string will not match.
Converting it to regex should look like
"^([^a]*a[^a]*a[^a]*)|([^b]*b[^b]*b[^b]*)|([^b]*c[^c]*c[^C]*)|([^d]*d[^d]*d[^d]*)$"
This can get problematic if the input alphabet is big, so that regex should be shortened somehow, but I can't think of it right now.
With .NET regular expressions you can do following:
(\w)(?<=(?=(?:.*\1){n})^.*) where n is variable
Where:
(\w) — any character, captured in first group.
(?<=^.*) — lookbehind assertion, which return us to the start of the string.
(?=(?:.*\1){n}) — lookahead assertion, to see if string have n instances of that character.
Demo
I would not use regular expressions for this. I would use a scripting language such as python. Try out this python function:
alpha = 'abcdefghijklmnopqrstuvwxyz'
def get_matched_chars(n, s):
s = s.lower()
return [char for char in alpha if s.count(char) == n]
The function will return a list of characters, all of which appear in the string s exactly n times. Keep in mind that I only included letters in my alphabet. You can change alpha to represent anything that you want to get matched.

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

How to use regular expressions to extract 3-tuple values from a string

I am trying to extract n 3-tuples (Si, Pi, Vi) from a string.
The string contains at least one such 3-tuple.
Pi and Vi are not mandatory.
SomeTextxyz#S1((property(P1)val(V1))#S2((property(P2)val(V2))#S3
|----------1-------------|----------2-------------|-- n
The desired output would be:
Si,Pi,Vi.
So for n occurrences in the string the output should look like this:
[S1,P1,V1] [S2,P2,V2] ... [Sn-1,Pn-1,Vn-1] (without the brackets)
Example
The input string could be something like this:
MyCarGarage#Mustang((property(PS)val(500))#Porsche((property(PS)val(425‌​)).
Once processed the output should be:
Mustang,PS,500 Porsche,PS,425
Is there an efficient way to extract those 3-tuples using a regular expression
(e.g. using C++ and std::regex) and what would it look like?
#(.*?)\(\(property\((.*?)\)val\((.*?)\)\) should do the trick.
example at http://regex101.com/r/bD1rY2
# # Matches the # symbol
(.*?) # Captures everything until it encounters the next part (ungreedy wildcard)
\(\(property\( # Matches the string "((property(" the backslashes escape the parenthesis
(.*?) # Same as the one above
\)val\( # Matches the string ")val("
(.*?) # Same as the one above
\)\) # Matches the string "))"
How you should implement this in C++ i don't know but that is the easy part :)
http://ideone.com/S7UQpA
I used C's <regex.h> instead of std::regex because std::regex isn't implemented in g++ (which is what IDEONE uses). The regular expression I used:
" In C(++)? regexes are strings.
# Literal match
([^(#]+) As many non-#, non-( characters as possible. This is group 1
( Start another group (group 2)
\\(\\(property\\( Yet more literal matching
([^)]+) As many non-) characters as possible. Group 3.
\\)val\\( Literal again
([^)]+) As many non-) characters as possible. Group 4.
\\)\\) Literal parentheses
) Close group 2
? Group 2 optional
" Close Regex
And some c++:
int getMatches(char* haystack, item** items){
first, calculate the length of the string (we'll use that later) and the number of # found in the string (the maximum number of matches)
int l = -1, ats = 0;
while (haystack[++l])
if (haystack[l] == '#')
ats++;
malloc a large enough array.
*items = (item*) malloc(ats * sizeof(item));
item* arr = *items;
Make a regex needle to find. REGEX is #defined elsewhere.
regex_t needle;
regcomp(&needle, REGEX, REG_ICASE|REG_EXTENDED);
regmatch_t match[5];
ret will hold the return value (0 for "found a match", but there are other errors you may want to be catching here). x will be used to count the found matches.
int ret;
int x = -1;
Loop over matches (ret will be zero if a match is found).
while (!(ret = regexec(&needle, haystack, 5, match,0))){
++x;
Get the name from match1
int bufsize = match[1].rm_eo-match[1].rm_so + 1;
arr[x].name = (char *) malloc(bufsize);
strncpy(arr[x].name, &(haystack[match[1].rm_so]), bufsize - 1);
arr[x].name[bufsize-1]=0x0;
Check to make sure the property (match[3]) and the value (match[4]) were found.
if (!(match[3].rm_so > l || match[3].rm_so<0 || match[3].rm_eo > l || match[3].rm_so< 0
|| match[4].rm_so > l || match[4].rm_so<0 || match[4].rm_eo > l || match[4].rm_so< 0)){
Get the property from match[3].
bufsize = match[3].rm_eo-match[3].rm_so + 1;
arr[x].property = (char *) malloc(bufsize);
strncpy(arr[x].property, &(haystack[match[3].rm_so]), bufsize - 1);
arr[x].property[bufsize-1]=0x0;
Get the value from match[4].
bufsize = match[4].rm_eo-match[4].rm_so + 1;
arr[x].value = (char *) malloc(bufsize);\
strncpy(arr[x].value, &(haystack[match[4].rm_so]), bufsize - 1);
arr[x].value[bufsize-1]=0x0;
} else {
Otherwise, set both property and value to NULL.
arr[x].property = NULL;
arr[x].value = NULL;
}
Move the haystack to past the match and decrement the known length.
haystack = &(haystack[match[0].rm_eo]);
l -= match[0].rm_eo;
}
Return the number of matches.
return x+1;
}
Hope this helps. Though it occurs to me now that you never answered kind of a vital question: What have you tried?

use regular expression to find and replace but only every 3 characters for DNA sequence

Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.
for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG
If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.
Is this possible?
Why use a regex? Try this instead, which is probably more efficient to boot:
public string DnaReplaceCodon(string input, string match, string replace) {
if (match.Length != 3 || replace.Length != 3)
throw new ArgumentOutOfRangeException();
var output = new StringBuilder(input.Length);
int i = 0;
while (i + 2 < input.Length) {
if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
output.Append(replace);
} else {
output.Append(input[i]);
output.Append(input[i]+1);
output.Append(input[i]+2);
}
i += 3;
}
// pick up trailing letters.
while (i < input.Length) output.Append(input[i]);
return output.ToString();
}
Solution
It is possible to do this with regex. Assuming the input is valid (contains only A, T, G, C):
Regex.Replace(input, #"\G((?:.{3})*?)" + codon, "$1" + replacement);
DEMO
If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$ (allow non-multiple of 3) or ^([ATCG]{3})*$ (sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.
Explanation
The construction above works for any codon. For the sake of explanation, let the codon be AAA. The regex will be \G((?:.{3})*?)AAA.
The whole regex actually matches the shortest substring that ends with the codon to be replaced.
\G # Must be at beginning of the string, or where last match left off
((?:.{3})*?) # Match any number of codon, lazily. The text is also captured.
AAA # The codon we want to replace
We make sure the matches only starts from positions whose index is multiple of 3 with:
\G which asserts that the match starts from where the previous match left off (or the beginning of the string)
And the fact that the pattern ((?:.{3})*?)AAA can only match a sequence whose length is multiple of 3.
Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?) part) does not contain the codon.
In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1), follows by the replacement codon.
NOTE
As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake
You can usually find out where a match starts and ends via m.start() and m.end(). If m.start() % 3 == 0 you found a relevant match.

Regex to calculate straight poker hand - Using ASCII CODE

In another question I learned how to calculate straight poker hand using regex (here).
Now, by curiosity, the question is: can I use regex to calculate the same thing, using ASCII CODE?
Something like:
regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
Matches: 45678, 23456
Doesn't matches: 45679 or 23459 (not in sequence)
Your main problem is really going to be that you're not using ASCII-consecutive encodings for your hands, you're using numerics for non-face cards, and non-consecutive, non-ordered characters for face cards.
You need to detect, at the start of the strings, 2345A, 23456, 34567, ..., 6789T, 789TJ, 89TJQ, 9TJQK and TJQKA.
These are not consecutive ASCII codes and, even if they were, you would run into problems since both A2345 and TJQKA are valid and you won't get A being both less than and greater than the other characters in the same character set.
If it has to be done by a regex, then the following regex segment:
(2345A|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)
is probably the easiest and most readable one you'll get.
There is no regex that will do what you want as the other answers have pointed out, but you did say that you want to learn regex, so here's another meta-regex approach that may be instructional.
Here's a Java snippet that, given a string, programmatically generate the pattern that will match any substring of that string of length 5.
String seq = "ABCDEFGHIJKLMNOP";
System.out.printf("^(%s)$",
seq.replaceAll(
"(?=(.{5}).).",
"$1|"
)
);
The output is (as seen on ideone.com):
^(ABCDE|BCDEF|CDEFG|DEFGH|EFGHI|FGHIJ|GHIJK|HIJKL|IJKLM|JKLMN|KLMNO|LMNOP)$
You can use this to conveniently generate the regex pattern to match straight poker hands, by initializing seq as appropriate.
How it works
. metacharacter matches "any" character (line separators may be an exception depending on the mode we're in).
The {5} is an exact repetition specifier. .{5} matches exactly 5 ..
(?=…) is positive lookahead; it asserts that a given pattern can be matched, but since it's only an assertion, it doesn't actually make (i.e. consume) the match from the input string.
Simply (…) is a capturing group. It creates a backreference that you can use perhaps later in the pattern, or in substitutions, or however you see fit.
The pattern is repeated here for convenience:
match one char
at a time
|
(?=(.{5}).).
\_________/
must be able to see 6 chars ahead
(capture the first 5)
The pattern works by matching one character . at a time. Before that character is matched, however, we assert (?=…) that we can see a total of 6 characters ahead (.{5})., capturing (…) into group 1 the first .{5}. For every such match, we replace with $1|, that is, whatever was captured by group 1, followed by the alternation metacharacter.
Let's consider what happens when we apply this to a shorter String seq = "ABCDEFG";. The ↑ denotes our current position.
=== INPUT === === OUTPUT ===
A B C D E F G ABCDE|BCDEFG
↑
We can assert (?=(.{5}).), matching ABCDEF
in the lookahead. ABCDE is captured.
We now match A, and replace with ABCDE|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
We can assert (?=(.{5}).), matching BCDEFG
in the lookahead. BCDEF is captured.
We now match B, and replace with BCDEF|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
:
:
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), and we are at
the end of the string, so we're done.
So we get ABCDE|BCDEF|CDEFG, which are all the substrings of length 5 of seq.
References
regular-expressions.info/Dot, Repetition, Grouping, Lookaround
Something like regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
You can not do anything remotely close to this in most regex flavors. This is simply not the kinds of patterns that regex is designed for.
There is no mainstream regex pattern that will succintly match any two consecutive characters that differ by x in their ASCII encoding.
For instructional purposes...
Here you go (see also on ideone.com):
String alpha = "ABCDEFGHIJKLMN";
String p = alpha.replaceAll(".(?=(.))", "$0(?=$1|\\$)|") + "$";
System.out.println(p);
// A(?=B|$)|B(?=C|$)|C(?=D|$)|D(?=E|$)|E(?=F|$)|F(?=G|$)|G(?=H|$)|
// H(?=I|$)|I(?=J|$)|J(?=K|$)|K(?=L|$)|L(?=M|$)|M(?=N|$)|N$
String p5 = String.format("(?:%s){5}", p);
String[] tests = {
"ABCDE", // true
"JKLMN", // true
"AAAAA", // false
"ABCDEFGH", // false
"ABCD", // false
"ACEGI", // false
"FGHIJ", // true
};
for (String test : tests) {
System.out.printf("[%s] : %s%n",
test,
test.matches(p5)
);
}
This uses meta-regexing technique to generate a pattern. That pattern ensures that each character is followed by the right character (or the end of the string), using lookahead. That pattern is then meta-regexed to be matched repeatedly 5 times.
You can substitute alpha with your poker sequence as necessary.
Note that this is an ABSOLUTELY IMPRACTICAL solution. It's much more readable to e.g. just check if alpha.contains(test) && (test.length() == 5).
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
SOLVED!
See in http://jsfiddle.net/g48K9/3
I solved using closure, in js.
String.prototype.isSequence = function () {
If (this == "A2345") return true; // an exception
return this.replace(/(\w)(\w)(\w)(\w)(\w)/, function (a, g1, g2, g3, g4, g5) {
return code(g1) == code(g2) -1 &&
code(g2) == code(g3) -1 &&
code(g3) == code(g4) -1 &&
code(g4) == code(g5) -1;
})
};
function code(card){
switch(card){
case "T": return 58;
case "J": return 59;
case "Q": return 60;
case "K": return 61;
case "A": return 62;
default: return card.charCodeAt();
}
}
test("23456");
test("23444");
test("789TJ");
test("TJQKA");
test("8JQKA");
function test(cards) {
alert("cards " + cards + ": " + cards.isSequence())
}
Just to clarify, ascii codes:
ASCII CODES:
2 = 50
3 = 51
4 = 52
5 = 53
6 = 54
7 = 55
8 = 56
9 = 57
T = 84 -> 58
J = 74 -> 59
Q = 81 -> 60
K = 75 -> 61
A = 65 -> 62