I was trying to understand one of the famous regex matching DP algorithm.
Just in case, people don't know this here is description and algorithm.
'.' Matches any single character.
'*' Matches zero or more of the preceding element.
The matching should cover the entire input string (not partial).
The function prototype should be:
bool isMatch(const char *s, const char *p)
Some examples:
isMatch("aa","a") → false
isMatch("aa","aa") → true
isMatch("aaa","aa") → false
isMatch("aa", "a*") → true
isMatch("aa", ".*") → true
isMatch("ab", ".*") → true
isMatch("aab", "c*a*b") → true
static boolean isMatch(String s, String p) {
boolean[][] dp = new boolean[s.length() + 1][p.length() + 1];
dp[0][0] = true;
for (int i = 1; i < dp[0].length; i++) {
if (p.charAt(i - 1) == '*') {
dp[0][i] = dp[0][i - 2];
}
}
for (int i = 1; i < dp.length; i++) {
for (int j = 1; j < dp[0].length; j++) {
char schar = s.charAt(i - 1);
char pchar = p.charAt(j - 1);
if (schar == pchar || pchar == '.') {
dp[i][j] = dp[i - 1][j - 1];
} else if (pchar == '*') {
if (schar != p.charAt(j - 2) && p.charAt(j - 2) != '.') {
// - a b *
// - t f f f
// a f t f t // b != a and b != '.' thus treat b* as 0 match
dp[i][j] = dp[i][j - 2];
} else {
// - a b *
// - t f f f
// a f t f t
// b f f t t // dp[i][j - 2] 0 match or dp[i][j - 1] 1 math or dp[i - 1][j] 2+ match (not sure why)
dp[i][j] = dp[i][j - 2] || dp[i][j - 1] || dp[i - 1][j];
}
}
}
}
return dp[s.length()][p.length()];
}
I understand the most of part but this part I don't get it
dp[i][j] = dp[i][j - 2] || dp[i][j - 1] || dp[i - 1][j];
dp[i - 1][j] people said this will cover 2+ matches but don't understand this part. Can someone please explain why I need to check dp[i - 1][j]?
I'll use a bit more informal notation so bear with me.
Capitals will denote strings (in the pattern those could include the special characters) and lowercase letters, '.' and '*' will stand for themselves.
Let's say we're matching Ax to Bx, that is some string starting with A (which is itself a string, like xyzz) ending in 'x', with a pattern starting with B (which is itself a pattern, for example, x.*) ending in 'x'. The result is the same as that of matching A to B (as we have no other choice but to match x to x).
We could write that as follows:
isMatch(Ax, Bx) = isMatch(A, B)
Similarly, matching Ax to By is impossible.
isMatch(Ax, Bx) = false
Easy enough. So that would correspond to the first if statement in the two nested loops.
Now let's take the case of an asterisk.
Matching Ax to By* can only be done by ignoring the y* (taking zero y's), that is by matching Ax to B.
isMatch(Ax, By*) = isMatch(Ax, B)
If however the y is replaced by a dot or by x, there are choices.
We'll take the case of Ax and Bx*. The two options are matching Ax to B (means taking zero x's) or matching A to Bx* (means taking an x, but we can still take more so the pattern doesn't change):
isMatch(Ax, Bx*) = isMatch(Ax, B) || isMatch(A, Bx*)
The last one would, in your code, translate to:
dp[i][j] = dp[i][j - 2] || dp[i - 1][j]
So now I'm wondering if your question was really about dp[i][j - 1], as that's what would confuse me.
I might be wrong but that one seems unnecessary.
The meaning of it is to drop the asterisk, that is, changing "zero or more"
to "exactly one", which is already covered by the other two cases, taking the second followed by the first.
Assumption here that the string will not contain special characters ' . ' and ' * ' because otherwise the code presented won't work !!
What does dp[i][j] represents?
It represents that if only first i characters of string and j characters of pattern are considered, do they match ?
State transition in case we encounter '*' in pattern:
Case 1: Take only 0 number of character preceding ' * ' in pattern.
The ' * ' alone doesn't mean anything. It is dependent on its preceding character.
dp[i][j-2] will completely ignore the preceding character in pattern as it considers only first j-2 characters, so preceding character along with ' * ' (jth character) in pattern are ignored.
Now if it is a case that the ith character in string and character preceding ' * ' happen to be same or character preceding ' * ' in pattern is ' . ' then 1 more case add up.
Observation here: .* can match any string
If above condition is satisfied, then consider below case.
Case 2: Continuation of usage of character preceding ' * ' for one or more times.
dp[i-1][j] represents this. Remember jth character in pattern is ' * ' .
So if first i-1 characters of string match with first j characters in pattern where jth character is a ' * ' (which suggests that we have used character preceding ' * ' at least once), then we can say that first i characters in string match with first j characters of pattern as we have already ensured that ith character matches the preceding character to ' * ' in pattern.
The case dp[i][j-1] is redundant and covered in case 2.
Note: Explanation for redundancy of dp[i][j-1]
dp[i][j-1] matches only once for character preceding ' * '. It is already covered in dp[i-1][j].
Reason:
We know ith character is matching j-1th character (Remember, we checked before considering this case).
So dp[i][j-1] = dp[i-1][j-2] which is already considered in calculation of dp[i-1][j].
dp[i-1][j] is more powerful because dp[i-1][j] = dp[i-1][j-2] || dp[i-2][j] as jth character is ' * '. So this is what gives memory property which allows us to use a character more than once.
Related
Hey all given the examples
1234567890
12 3456789 0
123 456-7890
12345 678 90
123-4-5-6789 0
total digits are fixed, and any grouping larger than some arbitrary k (min_group_length) are allowed, with max groups being set (optional but preferred)
I need to identify these in Regex, my current solution is disgusting.
I first find the partitions of 10, then the permutations of them, then convert all that to regex, resulting in hundreds of groupings
printAllUniqueParts(10);
int min_groups = 1;
int min_group_len = 2;
res.RemoveAll(s => s.Split(' ').ToList().Intersect(Enumerable.Range(0, min_group_len).Select(n => n.ToString()).ToList()).Count() >= 1 || s.Split(' ').Length < min_groups || s.Split(' ').Length == 0);
string reg = string.Empty;
for (int i = 1; i < res.Count; i++)
{
res[i] = res[i].Trim();
var r = res[i].Split(' ');
pair[] lp = r.Where(x => x.Length > 0).Select(y => new pair(y)).ToList().ToArray();
var qw = new List<string[]>();
perm(lp, 0, ref qw); // standard permutations List<string>
for (int k = 0; k < qw.Count; k++)
{
string s = "";
var v = string.Join(" ", qw[k]).Split(' ');
for (int j = 0; j < v.Length; j++)
{
s += #"\d{" + v[j] + "}" + (j == v.Length - 1 ? "" : "[ -]");
}
// res[i] = s;
reg += '(' + s + ")" + (k == qw.Count - 1 ? "" : "|");
}
}
This works, but there has to a more computationally cheap way than the below,
Any help appreciated.
(\d{7}[ -]\d{3})|(\d{3}[ -]\d{7})|(\d{6}[ -]\d{4})|(\d{4}[ -]\d{6})|(\d{5}[ -]\d{5})|(\d{5}[ -]\d{5})|(\d{4}[ -]\d{3}[ -]\d{3})(\d{4}[ -]\d{3}[ -]\d{3})(\d{3}[ -]\d{4}[ -]\d{3})(\d{3}[ -]\d{3}[ -]\d{4})(\d{3}[ -]\d{4}[ -]\d{3})(\d{3}[ -]\d{3}[ -]\d{4})
I guess you could try this
^(?=\d(?:\D*\d){9}$)\d{1,7}(?:[ -]\d{1,7})*$
https://regex101.com/r/t9Lnw1/1
Explained
^ # BOS
(?= # Validate the 10 digits first
\d
(?: \D* \d ){9}
$
)
# Then match the string based on grouping of
\d{1,7} # no more than let's say 7 for example
(?: [ -] \d{1,7} )*
$ # EOS
It sounds like you want sequences of at least K and at most 10 digits, but you also want to ignore any single - or space that might appear between two digits. So something like (\d[ -]?){K,10} should do the trick. Obviously the K needs to be replaced by an actual number, and this will incidentally pick up a trailing space or - after the sequence (which you likely just want to ignore anyways.)
If you really must avoid the trailing space or -, you could use \d([ -]?\d){K-1,9}
If you want some more complex structure than this, your best bet may be to use a simple regex that matches a superset of your requirements, and then post-process the matches to eliminate those that don't meet the details.
I am trying to figure out a couple regular expressions for the below cases:
Lines with a length divisible by n, but not by m for integers n and m
Lines that do not contain a certain number n of a given character,
but may contain more or less
I am a newcomer and would appreciate any clarification on these.
I've used JavaScript for my examples.
For the first one the key is to note that 'multiples' are just repeats. So using /(...)+/ will match 3 characters and then repeat that match as many times as it can. Each matching group doesn't need to be the same set of 3 characters but they do need to be consecutive. Suitable anchoring using ^ and $ ensures you're checking the exact string length and ?: can be used to negate.
e.g. Multiple of 5 but not 3:
/^(?!(.{3})+$)(.{5})+$/gm
Note that JavaScript uses / to mark the beginning and end of the expression and gm are modifiers to perform global, multiline matches. I wasn't clear what you meant by matching 'lines' so I've assumed the string itself contains newline characters that must be taken into consideration. If you had, say, and array of lines and could check each one individually then things get slightly easier, or a lot easier in the case of your second question.
Demo for the first question:
var chars = '12345678901234567890',
str = '';
for (var i = 1 ; i <= chars.length ; ++i) {
str += chars.slice(0, i) + '\n';
}
console.log('Full text is:');
console.log(str);
console.log('Lines with length that is a multiple of 2 but not 3:');
console.log(lineLength(str, 2, 3));
console.log('Lines with length that is a multiple of 3 but not 2:');
console.log(lineLength(str, 3, 2));
console.log('Lines with length that is a multiple of 5 but not 3:');
console.log(lineLength(str, 5, 3));
function lineLength(str, multiple, notMultiple) {
return str.match(new RegExp('^(?!(.{' + notMultiple + '})+$)(.{' + multiple + '})+$', 'gm'));
}
For the second question I couldn't come up with a nice way to do it. This horror show is what I ended up with. It wasn't too bad to match n occurrences of a particular character in a line but matching 'not n' proved difficult. I ended up matching {0,n-1} or {n+1,} but the whole thing doesn't feel so great to me. I suspect there's a cleverer way to do it that I'm currently not seeing.
var str = 'a\naa\naaa\naaaa\nab\nabab\nababab\nabababab\nba\nbab\nbaba\nbbabbabb';
console.log('Full string:');
console.log(str);
console.log('Lines with 1 occurrence of a:');
console.log(mOccurrences(str, 'a', 1));
console.log('Lines with 2 occurrences of a:');
console.log(mOccurrences(str, 'a', 2));
console.log('Lines with 3 occurrences of a:');
console.log(mOccurrences(str, 'a', 3));
console.log('Lines with not 1 occurrence of a:');
console.log(notMOccurrences(str, 'a', 1));
console.log('Lines with not 2 occurrences of a:');
console.log(notMOccurrences(str, 'a', 2));
console.log('Lines with not 3 occurrences of a:');
console.log(notMOccurrences(str, 'a', 3));
function mOccurrences(str, character, m) {
return str.match(new RegExp('^[^' + character + '\n]*(' + character + '[^' + character + '\n]*){' + m + '}[^' + character + '\n]*$', 'gm'));
}
function notMOccurrences(str, character, m) {
return str.match(new RegExp('^([^' + character + '\n]*(' + character + '[^' + character + '\n]*){0,' + (m - 1) + '}[^' + character + '\n]*|[^' + character + '\n]*(' + character + '[^' + character + '\n]*){' + (m + 1) + ',}[^' + character + '\n]*)$', 'gm'));
}
The key to how that works is that it tries to find n occurrences of a separated by sequences of [^a], with \n thrown in to stop it walking onto the next line.
In a real world scenario I would probably do the splitting into lines first as that makes things much easier. Counting the number of occurrences of a particular character is then just:
str.replace(/[^a]/g, '').length;
// Could use this instead but note in JS it'd fail if length is 0
str.match(/a/g, '').length;
Again, this assumes a JavaScript environment. If you were using regexes in an environment where you could literally only pass in the regex as an argument then it's back to my earlier horror show.
special character set: `~!##$%^&*()_-+={}[]\|:;""'<>,.?/
Is this the right way to search for items within that special character set?
Regex.IsMatch(Result.Text, "^[`-/]")
I always get an error.... Do I need to use ASCII codes? If so, how can I do this? Thank you in advance!
Regular expressions have a different escape syntax and rules than VB.NET. Since you're dealing with a regex string in your code, you have to make sure the string is escaped properly for regex and VB.NET.
In your example, the - needs to be escaped with a ...
Regex.IsMatch(Result.Text, "^[`\-/]")
To match any character in the provided string, try this...
Regex.IsMatch(Result.Text, "[`~!##\$%\^&\*\(\)_\-\+=\{\}\[\]\\\|:;""'<>,\.\?/]")
Try this:
[`~!##$%^&*()_+={}\[\]\\|:;""'<>,.?/-]
Working regex example:
http://regex101.com/r/aU3wO9
In a char set, only some of them need to be escaped. like ] and \ and if - is at the end it doesn't need to be escaped. I'm not sure about [ so I just escaped it anyway.
You can use an escape character in the string where there is a special character
Dim str() As Byte
Dim j, n As Long
Dim ContainsSpecialChar As Boolean
Dim TempVendorName As String
str = Trim(VendorName)
n = 1
For j = 0 To UBound(str) - 1 Step 2
If (str(j) > 32 And str(j) < 47) Or (str(j) > 57 And str(j) < 65) Or (str(j) > 90 And str(j) < 97) Or (str(j) > 122) Then
ContainsSpecialChar = True
TempVendorName = Left(VendorName, n - 1) + "\" + Mid(VendorName, n)
n = n + 1
End If
n = n + 1
Next
I am trying to extract n 3-tuples (Si, Pi, Vi) from a string.
The string contains at least one such 3-tuple.
Pi and Vi are not mandatory.
SomeTextxyz#S1((property(P1)val(V1))#S2((property(P2)val(V2))#S3
|----------1-------------|----------2-------------|-- n
The desired output would be:
Si,Pi,Vi.
So for n occurrences in the string the output should look like this:
[S1,P1,V1] [S2,P2,V2] ... [Sn-1,Pn-1,Vn-1] (without the brackets)
Example
The input string could be something like this:
MyCarGarage#Mustang((property(PS)val(500))#Porsche((property(PS)val(425)).
Once processed the output should be:
Mustang,PS,500 Porsche,PS,425
Is there an efficient way to extract those 3-tuples using a regular expression
(e.g. using C++ and std::regex) and what would it look like?
#(.*?)\(\(property\((.*?)\)val\((.*?)\)\) should do the trick.
example at http://regex101.com/r/bD1rY2
# # Matches the # symbol
(.*?) # Captures everything until it encounters the next part (ungreedy wildcard)
\(\(property\( # Matches the string "((property(" the backslashes escape the parenthesis
(.*?) # Same as the one above
\)val\( # Matches the string ")val("
(.*?) # Same as the one above
\)\) # Matches the string "))"
How you should implement this in C++ i don't know but that is the easy part :)
http://ideone.com/S7UQpA
I used C's <regex.h> instead of std::regex because std::regex isn't implemented in g++ (which is what IDEONE uses). The regular expression I used:
" In C(++)? regexes are strings.
# Literal match
([^(#]+) As many non-#, non-( characters as possible. This is group 1
( Start another group (group 2)
\\(\\(property\\( Yet more literal matching
([^)]+) As many non-) characters as possible. Group 3.
\\)val\\( Literal again
([^)]+) As many non-) characters as possible. Group 4.
\\)\\) Literal parentheses
) Close group 2
? Group 2 optional
" Close Regex
And some c++:
int getMatches(char* haystack, item** items){
first, calculate the length of the string (we'll use that later) and the number of # found in the string (the maximum number of matches)
int l = -1, ats = 0;
while (haystack[++l])
if (haystack[l] == '#')
ats++;
malloc a large enough array.
*items = (item*) malloc(ats * sizeof(item));
item* arr = *items;
Make a regex needle to find. REGEX is #defined elsewhere.
regex_t needle;
regcomp(&needle, REGEX, REG_ICASE|REG_EXTENDED);
regmatch_t match[5];
ret will hold the return value (0 for "found a match", but there are other errors you may want to be catching here). x will be used to count the found matches.
int ret;
int x = -1;
Loop over matches (ret will be zero if a match is found).
while (!(ret = regexec(&needle, haystack, 5, match,0))){
++x;
Get the name from match1
int bufsize = match[1].rm_eo-match[1].rm_so + 1;
arr[x].name = (char *) malloc(bufsize);
strncpy(arr[x].name, &(haystack[match[1].rm_so]), bufsize - 1);
arr[x].name[bufsize-1]=0x0;
Check to make sure the property (match[3]) and the value (match[4]) were found.
if (!(match[3].rm_so > l || match[3].rm_so<0 || match[3].rm_eo > l || match[3].rm_so< 0
|| match[4].rm_so > l || match[4].rm_so<0 || match[4].rm_eo > l || match[4].rm_so< 0)){
Get the property from match[3].
bufsize = match[3].rm_eo-match[3].rm_so + 1;
arr[x].property = (char *) malloc(bufsize);
strncpy(arr[x].property, &(haystack[match[3].rm_so]), bufsize - 1);
arr[x].property[bufsize-1]=0x0;
Get the value from match[4].
bufsize = match[4].rm_eo-match[4].rm_so + 1;
arr[x].value = (char *) malloc(bufsize);\
strncpy(arr[x].value, &(haystack[match[4].rm_so]), bufsize - 1);
arr[x].value[bufsize-1]=0x0;
} else {
Otherwise, set both property and value to NULL.
arr[x].property = NULL;
arr[x].value = NULL;
}
Move the haystack to past the match and decrement the known length.
haystack = &(haystack[match[0].rm_eo]);
l -= match[0].rm_eo;
}
Return the number of matches.
return x+1;
}
Hope this helps. Though it occurs to me now that you never answered kind of a vital question: What have you tried?
i want to get math equation only with addition such as 1+2+3 and return its result. i have the following code, and the problem is that it doesn't deal with doubles (i cant write 2.2+3.4)
I tried to change the regex expression to ([\+-]?\d+.\d+)([\+-])(-?(\d+.\d+)) and now it doesnt deal with integers (i cant write 2+4). what should be the correct regex expression to deal with doubles and integers? thanx
the code:
regEx = new Regex(#"([\+-]?\d+)([\+-])(-?(\d+))");
m = regEx.Match(Expression, 0);
while (m.Success)
{
double result;
switch (m.Groups[2].Value)
{
case "+":
result = Convert.ToDouble(m.Groups[1].Value) + Convert.ToDouble(m.Groups[3].Value);
if ((result < 0) || (m.Index == 0)) Expression = regEx.Replace(Expression, DoubleToString(result), 1);
else Expression = regEx.Replace(Expression, "+" + result, 1);
m = regEx.Match(Expression);
continue;
case "-":
result = Convert.ToDouble(m.Groups[1].Value) - Convert.ToDouble(m.Groups[3].Value);
if ((result < 0) || (m.Index == 0)) Expression = regEx.Replace(Expression, DoubleToString(result), 1);
else Expression = regEx.Replace(Expression, "+" + result, 1);
m = regEx.Match(Expression);
continue;
}
}
if (Expression.StartsWith("--")) Expression = Expression.Substring(2);
return Expression;
}
As the comments have stated, RegEx is not a good solution to this problem. You would be much better off with either a simple split statement (if you only want to support the + and - operators), or an actual parser (if you want to support actual mathematical expressions).
But, for the sake of explaining some RegEx, your problem is that \d+.\d+ matches "one or more digits, followed by any character, followed by one or more digits." If you gave it an integer greater than 99, it would work, since you're matching . (any character) and not \. (specifically the dot character).
A simpler version would be [\d\.]+, which matches one-or-more digits-or-dots. The problems is that it allows multiple dots, so 8.8.8.8 is a valid match. So what you really want is \d+\.?\d*, which matches one-or-more digits, one-or-zero dots, and zero-or-more digits. Thus 2, 2., and 2.05 are all valid matches.