Couchbase xdcr regex - How do I exclude keys using regex? - regex

I am trying to exclude certain documents from being transported to ES using XDCR.
I have the following regex that filters ABCD and IJ
https://regex101.com/r/gI6sN8/11
Now, I want to use this regex in the XDCR filtering
^(?!.(ABCD|IJ)).$
How do I exclude keys using regex?
EDIT:
What if I want to select everything that doesn't contains ABCDE and ABCHIJ.
I tried
https://regex101.com/r/zT7dI4/1

edit:
Sorry, after further looking at it, this method is invalid. For instance, [^B] allows an A to get by, letting AABCD slip through (since it will match AA at first, then match BCD with the [^A]. Please disregard this post.
Demo here shows below method is invalid
(disregard this)
You could use a posix style trick to exclude words.
Below is to exclude ABCD and IJ.
You get a sense of the pattern from this.
Basically, you put all the first letters into a negative class
as the first in the alternation list, then handle each word
in a separate alternation.
^(?:[^AI]+|(?:A(?:[^B]|$)|AB(?:[^C]|$)|ABC(?:[^D]|$))|(?:I(?:[^J]|$)))+$
Demo
Expanded
^
(?:
[^AI]+
|
(?: # Handle 'ABCD`
A
(?: [^B] | $ )
| AB
(?: [^C] | $ )
| ABC
(?: [^D] | $ )
)
|
(?: # Handle 'IJ`
I
(?: [^J] | $ )
)
)+
$

Hopefully one day there will be built-in support for inverting the match expression. In the mean time, here's a Java 8 program that generates regular expressions for inverted prefix matching using basic regex features supported by the Couchbase XDCR filter.
This should work as long as your key prefixes are somehow delimited from the remainder of the key. Make sure to include the delimiter in the input when modifying this code.
Sample output for red:, reef:, green: is:
^([^rg]|r[^e]|g[^r]|re[^de]|gr[^e]|red[^:]|ree[^f]|gre[^e]|reef[^:]|gree[^n]|green[^:])
File: NegativeLookaheadCheater.java
import java.util.*;
import java.util.stream.Collectors;
public class NegativeLookaheadCheater {
public static void main(String[] args) {
List<String> input = Arrays.asList("red:", "reef:", "green:");
System.out.println("^" + invertMatch(input));
}
private static String invertMatch(Collection<String> literals) {
int maxLength = literals.stream().mapToInt(String::length).max().orElse(0);
List<String> terms = new ArrayList<>();
for (int i = 0; i < maxLength; i++) {
terms.addAll(terms(literals, i));
}
return "(" + String.join("|", terms) + ")";
}
private static List<String> terms(Collection<String> words, int index) {
List<String> result = new ArrayList<>();
Map<String, Set<Character>> prefixToNextLetter = new LinkedHashMap<>();
for (String word : words) {
if (word.length() > index) {
String prefix = word.substring(0, index);
prefixToNextLetter.computeIfAbsent(prefix, key -> new LinkedHashSet<>()).add(word.charAt(index));
}
}
prefixToNextLetter.forEach((literalPrefix, charsToNegate) -> {
result.add(literalPrefix + "[^" + join(charsToNegate) + "]");
});
return result;
}
private static String join(Collection<Character> collection) {
return collection.stream().map(c -> Character.toString(c)).collect(Collectors.joining());
}
}

Related

Is it possible to combine two regex patterns into one without using OR | , pipe character

I am working on regex to match either(not both) of two conditions like this:
Digits following A. ex) A123, regex (?<=A)\d+
Digits followed by Z. ex) 123Z, regex \d+(?=Z)
So naive way is to use '|' to combine them like this:
(?<=A)\d+|\d+(?=Z)
This could be better for readability and maintainability, but I'm just curious if there is another way to make it without |.
Here is a C# test code for that.
string pattern = #"(?<=A)\d+|\d+(?=Z)";
var currencies = new string[]{
"A123", // match
"123Z", // match
"123", // NOT match
};
foreach (var c in currencies)
{
Match m = Regex.Match(c, pattern, RegexOptions.IgnoreCase);
Console.WriteLine("Matched:" + m.Success + ". Value:" + m.Value);
}
The output of the code here.
Matched:True. Value:123
Matched:True. Value:123
Matched:False. Value:

Regular Expression: match and count "A" and stop after found "B" by counted times [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

find a string pattern in a string array

I need to count the occurrence of specified patterns in the input strands and produces a report for each pattern.
The input string would contain 1 AA AATTCGAA end
the 1 signifies one pattern to search for and AA is the pattern and the next is the part you would search AA in.
My idea is to :
public static void main(String[] args){
Scanner s = new Scanner(System.in);
System.out.println("How many patterns do you want and enter patterns and DNA Sequence(type 'end' to signify end):");
String DNA = s.nextLine();
process(DNA);
}
public static void process(String DNA){
String number = DNA.replaceFirst(".*?(\\d+).*", "$1");
int N = Integer.parseInt(number);
DNA.toUpperCase();
String[] DNAarray;
DNAarray = DNA.split(" ");
for(int i=1; i<N; i++){
int count=0;
for(int j =0; j < DNAarray.length; j++) {
if(DNAarray[i+N].contains(DNAarray[i])){
count= count++;
}
}
System.out.println("Pattern:"+DNAarray[i]+ "Count:"+count);
}
This should do it:
using System;
using System.Text.RegularExpressions;
public class Program
{
public void Main()
{
Console.WriteLine(PatternCount("1 AA AADDRRSSAA"));
}
public int PatternCount(string sDNA) {
Regex reParts = new Regex("(\\d+)\\s(\\w\\w)\\s(\\w+)");
Match m = reParts.Match(sDNA);
if (m.Success)
{
return Regex.Matches(m.Groups[3].Value, m.Groups[2].Value).Count;
}
else
return 0;
}
}
First RE splits the input into count, pattern and data. (Not sure why you want to limit the number of patterns to search for. This code ignores that. Modify after your needs...)
Second RE equals the pattern wanted and "Matches" counts the number of occurrences. Work from here.
Regards
(I feel good today, doing people's work ;))
Really no need to put the number of searches. And, actually this could be done
with a single regex. I can't remember if Dot-Net supports the \G anchor,
but this is really not necessary anyway. I left it in.
Each Match:
Finds a new key.
Captures the keys sub-string matches at the end.
Advances the search position by just the key.
So, sit in a Find loop.
On each match print the 'Key' capture buffer,
then print the capture collection 'Values' count.
Thats all there is to it.
The regex will search for overlapping keys. To change it to exclusive keys,
change the = to : as shown in the comments.
You can also make it a little more specific. For example, change all the \w's to [A-Z], etc...
The regex:
(?:
^ [ \d]*
| \G
)
(?<Key> \w+ ) #_(1)
[ ]+
(?=
(?: \w+ [ ]+ )*
(?= \w )
(?:
(?= # <- Change the = to : to get non-overlapped matches
(?<Values> \1 ) #_(2)
)
| .
)*
$
)
This is a perl test case
# $str = '2 6 AA TT PP AAATTCGAA';
# $count = 0;
#
# while ( $str =~ /(?:^[ \d]*|\G)(\w+)[ ]+(?=(?:\w+[ ]+)*(?=\w)(?:(?=(\1)(?{ $count++ }))|.)*$)/g )
# {
# print "search = '$1'\n";
# print "found = '$count'\n";
# $count = 0;
#
# }
#
# Output >>
#
# search = 'AA'
# found = '3'
# search = 'TT'
# found = '1'
# search = 'PP'
# found = '0'
#
#

Solution required to create Regex pattern

I am developing a windows application in C#. I have been searching for the solution to my problem in creating a Regex pattern. I want to create a Regex pattern matching the either of the following strings:
XD=(111111) XT=( 588.466)m3 YT=( .246)m3 G=( 3.6)V N=(X0000000000) M=(Y0000000000) O=(Z0000000000) Date=(06.01.01)Time=(00:54:55) Q=( .00)m3/hr
XD=(111 ) XT=( 588.466)m3 YT=( .009)m3 G=( 3.6)V N=(X0000000000) M=(Y0000000000) O=(Z0000000000) Date=(06.01.01)Time=(00:54:55) Q=( .00)m3/hr
The specific requirement is that I need all the values from the above given string which is a collection of key/value pairs. Also, would like to know the right approach (in terms of efficiency and performance) out of the two...Regex pattern matching or substring, for the above problem.
Thank you all in advance and let me know, if more details are required.
I don't know C#, so there probably is a better way to build a key/value array. I constructed a regex and handed it to RegexBuddy which generated the following code snippet:
StringCollection keyList = new StringCollection();
StringCollection valueList = new StringCollection();
StringCollection unitList = new StringCollection();
try {
Regex regexObj = new Regex(
#"(?<key>\b\w+) # Match an alphanumeric identifier
\s*=\s* # Match a = (optionally surrounded by whitespace)
\( # Match a (
\s* # Match optional whitespace
(?<value>[^()]+) # Match the value string (anything except parens)
\) # Match a )
(?<unit>[^\s=]+ # Match an optional unit (anything except = or space)
\b # which must end at a word boundary
(?!\s*=) # and not be an identifier (i. e. followed by =)
)? # and is optional, as mentioned.",
RegexOptions.IgnorePatternWhitespace);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
keyList.Add(matchResult.Groups["key"].Value);
valueList.Add(matchResult.Groups["value"].Value);
unitList.Add(matchResult.Groups["unit"].Value);
matchResult = matchResult.NextMatch();
}
Regex re=new Regex(#"(\w+)\=\(([\d\s\.]+)\)");
MatchCollection m=re.Matches(s);
m[0].Groups will have { XD=(111111), XD, 111111 }
m[1].Groups will have { XT=( 588.466), XT, 588.466 }
String[] rows = { "XD=(111111) XT=( 588.466)m3 YT=( .246)m3 G=( 3.6)V N=(X0000000000) M=(Y0000000000) O=(Z0000000000) Date=(06.01.01)Time=(00:54:55) Q=( .00)m3/hr",
"XD=(111 ) XT=( 588.466)m3 YT=( .009)m3 G=( 3.6)V N=(X0000000000) M=(Y0000000000) O=(Z0000000000) Date=(06.01.01)Time=(00:54:55) Q=( .00)m3/hr" };
foreach (String s in rows) {
MatchCollection Pair = Regex.Matches(s, #"
(\S+) # Match all non-whitespace before the = and store it in group 1
= # Match the =
(\([^)]+\S+) # Match the part in brackets and following non-whitespace after the = and store it in group 2
", RegexOptions.IgnorePatternWhitespace);
foreach (Match item in Pair) {
Console.WriteLine(item.Groups[1] + " => " + item.Groups[2]);
}
Console.WriteLine();
}
Console.ReadLine();
If you want to extract the units also then use this regex
#"(\S+)=(\([^)]+(\S+))
I added a set of brackets around it, then you will find the unit in item.Groups[3]

Matching token sequences

I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.
11111
012345678901234
ab ab bc a cba
In this example, given window size=5, I would like to match cba at positions 12-14, and abc in positions 3-7.
Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?
I am hoping to implement this in Java.
Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':
(?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}
So, while basically matching any 5 characters (with .{5}), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X) matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.
Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):
public static void find(String haystack, String tokens, int windowLen) {
char[] tokenChars = tokens.toCharArray();
int hayLen = haystack.length();
int pos = 0;
nextPos:
while (pos + windowLen <= hayLen) {
for (char c : tokenChars) {
int i = haystack.indexOf(c, pos);
if (i < 0) return;
if (i - pos >= windowLen) {
pos = i - windowLen + 1;
continue nextPos;
}
}
// match found at pos
System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
pos++;
}
}
This tested Java program has a commented regex which does the trick:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "ab ab bc a cba";
Pattern p = Pattern.compile(
"# Match 5 char sequences containing: a and b and c\n" +
"(?=[abc]) # Assert first char is a, b or c.\n" +
"(?=.{0,4}a) # Assert an 'a' within 5 chars.\n" +
"(?=.{0,4}b) # Assert an 'b' within 5 chars.\n" +
"(?=.{0,4}c) # Assert an 'c' within 5 chars.\n" +
".{5} # If so, match the 5 chers.",
Pattern.COMMENTS);
Matcher m = p.matcher(s);
while (m.find()) {
System.out.print("Match = \""+ m.group() +"\"\n");
}
}
}
Note that there is another valid sequence S9:13" a cb" in your test data (before the S12:14"cba". Assuming you did not want to match this one, I added an additional constraint to filter it out, which requires that the 5 char window must begin with an a, b or c.
Here is the output from the script:
Match = "ab bc"
Match = "a cba"
Well, one possibility (albeit a completely impractical one) is simply to match against all permutations:
abc..|ab.c.|ab..c| .... etc.
This can be factorised somewhat:
ab(c..|.c.|..c)|a.(bc.|b.c .... etc.
I'm not sure if you can do better with regex.
Pattern p = Pattern.compile("(?:a()|b()|c()|.){5}\\1\\2\\3");
String s = "ab ab bc a cba";
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.group());
}
output:
ab bc
a cb
This is inspired by Recipe #5.7 in Regular Expressions Cookbook. Each back-reference (\1, \2, \3) acts like a zero-width assertion, indicating that the corresponding capturing group participated in the match, even though the group itself didn't consume any characters.
The authors warn that this trick relies on behavior that's undocumented in most flavors. It works in Java, .NET, Perl, PHP, Python and Ruby (original and Oniguruma), but not in JavaScript or ActionScript.