Iterative regex matching - c++

We've become fairly adept at generating various regular expressions to match input strings, but we've been asked to try to validate these strings iteratively. Is there an easy way to iteratively match the input string against a regular expression?
Take, for instance, the following regular expression:
[EW]\d{1,3}\.\d
When the user enters "E123.4", the regular expression is met. How do I validate the user's input while they type it? Can I partially match the string "E1" against the regular expression?
Is there some way to say that the input string only partially matched the input? Or is there a way to generate sub-expressions out of the master expression automatically based on string length?
I'm trying to create a generic function that can take any regular expression and throw an exception as soon as the user enters something that cannot meet the expression. Our expressions are rather simple in the grand scheme of things, and we are certainly not trying to parse HTML :)
Thanks in advance.
David

You could do it only by making every part of the regex optional, and repeating yourself:
^([EW]|[EW]\d{1,3}|[EW]\d{1,3}\.|[EW]\d{1,3}\.\d)$
This might work for simple expressions, but for complex ones this is hardly feasible.

Hard to say... If the user types an "E", that matches the begining but not the rest. Of course, you don't know if they will continue to type "123.4" or if they will just hit "Enter" (I assume you use "Enter" to indicate the end of input) right away. You could use groups to test that all 3 groups match, such as:
([EW])(\d{1,3})(\.\d)
After the first character, try to match the first group. After the next few inputs, match the first AND second group, and when they enter the '.' and last digit you have to find a match for all 3 groups.

You could use partial matches if your regex lib supports it (as does Boost.Regex).
Adapting the is_possible_card_number example on this page to the example in your question:
#include <boost/regex.hpp>
// Return false for partial match, true for full match, or throw for
// impossible match
bool
CheckPartialMatch(const std::string& Input, const boost::regex& Regex)
{
boost::match_results<std::string::const_iterator> what;
if(0 == boost::regex_match(Input, what, Regex, boost::match_default | boost::match_partial))
{
// the input so far could not possibly be valid so reject it:
throw std::runtime_error(
"Invalid data entered - this could not possibly be a match");
}
// OK so far so good, but have we finished?
if(what[0].matched)
{
// excellent, we have a result:
return true;
}
// what we have so far is only a partial match...
return false;
}
int main()
{
const boost::regex r("[EW]\\d{1,3}\\.\\d");
// The input is incomplete, so we expect a "false" result
assert(!CheckPartialMatch("E1", r));
// The input completely satisfies the expression, so expect a "true" result
assert(CheckPartialMatch("E123.4", r));
try{
// Input can't match the expression, so expect an exception.
CheckPartialMatch("EX3", r);
assert(false);
}
catch(const std::runtime_error&){
}
return 0;
}

Related

How can I allow my program to continue when a regex doesn't match?

I want to use the regex crate and capture numbers from a string.
let input = "abcd123efg";
let re = Regex::new(r"([0-9]+)").unwrap();
let cap = re.captures(e).unwrap().get(1).unwrap().as_str();
println!("{}", cap);
It worked if numbers exist in input, but if numbers don't exist in input I get the following error:
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value'
I want my program continue if the regex doesn't match. How can I handle this error?
You probably want to (re-)read the chapter on "Error Handling" in the Rust book. Error handling in Rust is mostly done via the types Result<T, E> and Option<T>, both representing an optional value of type T with Result<T, E> carrying additional information about the absence of the main value.
You are calling unwrap() on each Option or Result you encounter. unwrap() is a method saying: "if there is no value of type T, let the program explode (panic)". You only want to call unwrap() if an absence of a value is not expected and thus would be a bug! (NB: actually, the unwrap() in your second line is a perfectly reasonable use!)
But you use unwrap() incorrectly twice: on the result of captures() and on the result of get(1). Let's tackle captures() first; it returns an Option<_> and the docs say:
If no match is found, then None is returned.
In most cases, the input string not matching the regex is to be expected, thus we should deal with it. We could either just match the Option (the standard way to deal with those possible errors, see the Rust book chapter) or we could use Regex::is_match() before, to check if the string matches.
Next up: get(1). Again, the docs tell us:
Returns the match associated with the capture group at index i. If i does not correspond to a capture group, or if the capture group did not participate in the match, then None is returned.
But this time, we don't have to deal with that. Why? Our regex (([0-9]+)) is constant and we know that the capture group exists and encloses the whole regex. Thus we can rule out both possible situations that would lead to a None. This means we can unwrap(), because we don't expect the absence of a value.
The resulting code could look like this:
let input = "abcd123efg";
let re = Regex::new(r"([0-9]+)").unwrap();
match re.captures(e) {
Some(caps) => {
let cap = caps.get(1).unwrap().as_str();
println!("{}", cap);
}
None => {
// The regex did not match. Deal with it here!
}
}
You can either check with is_match or just use the return type of captures(e) to check it (it's an Option<Captures<'t>>) instead of unwrapping it, by using a match (see this how to handle options).

+ is supposed to be greedy, so why am I getting a lazy result?

Why does the following regex return 101 instead of 1001?
console.log(new RegExp(/1(0+)1/).exec('101001')[0]);
I thought that + was greedy, so the longer of the two matches should be returned.
IMO this is different from Using javascript regexp to find the first AND longest match because I don't care about the first, just the longest. Can someone correct my definition of greedy? For example, what is the difference between the above snippet and the classic "oops, too greedy" example of new RegExp(/<(.+)>/).exec('<b>a</b>')[0] giving b>a</b?
(Note: This seems to be language-agnostic (it also happens in Perl), but just for ease of running it in-browser I've used JavaScript here.)
Regex always reads from left to right! It will not look for something longer. In the case of multiple matches, you have to re-execute the regex to get them, and compare their lengths yourself.
Greedy means up to the rightmost occurrence, it never means the longest in the input string.
Regex itself is not the correct tool to extract the longest match. You might get all the substrings that match your pattern, and get the longest one using the language specific means.
Since the string is parsed from left to right, 101 will get matched in 101001 first, and the rest (001) will not match (as the 101 and 1001 matches are overlapping). You might use /(?=(10+1))./g and then check the length of each Group 1 value to get the longest one.
var regex = /(?=(10+1))./g;
var str = "101001";
var m, res=[];
while ((m = regex.exec(str)) !== null) {
res.push(m[1]);
}
console.log(res); // => ["101", "1001"]
if (res.length>0) {
console.log("The longest match:", res.sort(function (a, b) { return b.length - a.length; })[0]);
} // => 1001

Google sheets custom function w/ regex fails on alternating rows when used in range

I just wrote up a simple google sheets function to fix some URLs. This function works fine in a browser, when passed the array of values manually. When called from google sheets, the function fails for every other row.
This isn't a problem with data, since I can make it work for the "failing" rows by moving the formula down one row, or calling it individually for each cell. I think this may be an issue with regex inside google sheets.
var pattern = /^http:\/\/(.*\/\d\/.*)_(.*)\/(g\d+p.*)$/ig;
function encode(input) {
if (!input) return "";
if (input.map) {
return input.map(encode);
} else {
try {
// same error happens, at this location, w/ or w/o toString()
var matches = pattern.exec(input.toString());
return matches[1] + encodeURIComponent(matches[2]) + matches[3];
} catch (e) {
return "error=" + e.message + " value = [" + input + "] ";
}
}
}
Edit: To make things clearer for those who come after, this also fails the same way when the regex is inside the "else" clause:
else {
var matches = /^(http:\/\/.*\/\d\/.*_)(.*)(\/g\d+p.*)$/ig.exec(input.toString());
... continues as normal
For alternating rows of the data, I get this error message:
error=Cannot read property "1" from null. value = [ http://... ]
I have tried:
Putting the regex inside the try{}
Putting the regex inside the encode{} function
Writing two separate functions (one for doing 1 value)
In the failure case I have data like this:
A1-A8 have URLs in them
B1 has the formula "=encode(A1:A8)"
Data in B1, B3, B5, B7 calculate perfectly
Data in B2, B4, B6, B8 error out (my error message shows up)
Moving the formula to cell "B2" and saying =encode(A2:A8) causes all the "failing" rows to calculate and the others to fail!
The short answer (as confirmed by your comment on the OP) is to remove the final "g" (the global flag) from the regex.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/exec
Syntax regexObj.exec(str)
Parameters str The string against which to match the regular expression.
...
If your regular expression uses the "g" flag, you can use the exec()
method multiple times to find successive matches in the same string.
When you do so, the search starts at the substring of str specified by
the regular expression's lastIndex property (test() will also advance
the lastIndex property).
So it seems you really should only include the global flag when you intend to continue to search for matches in the same string.
As to why it worked in other environments, I'm not sure. Indeed, it seems a bit strange to continue searching from where you left off, even though you are applying exec to an entirely new string. Perhaps the implementation in GAS is a little bit "off" - someone with more knowledge might be able to comment on this.
To elaborate on my comment, the error means that matches is empty or non-existent, which probably means that the regex did not find a match. So it is important to see whether the value of input should match or indeed does not conform to the requirements of the regex.
The regex does the following:
^http:\/\/(.*\/\d\/.*)_(.*)\/(g\d+p.*)$
Debuggex Demo, Matched text:
http://whatever/3/some_thing/g4p/can be anything
^^^^^^^ ^^^ ^ ^^^^
So if any of the following is not found in the URL, no match will be returned:
URL does not start with http:// (but, for instance, https://)
There is no occurrence of: /, a number, /
There is no _
There is no occurrence of /g, some numbers, p
Are you sure the text meets all these requirements every time?

Need Regex Help

Can anybody help me with a regex? I have a string with digits like:
X024123099XYAAXX99RR
I need a regex to check if a user has inserted the correct information. The rule should have also a fallback that the input is checked from left to right.
For example, when tested these inputs should return TRUE:
X024
X024123099X
X024123099XYA
X024123099XYAAXX99R
And these ones should return FALSE:
XX024
X02412AA99X
X024123099XYAAXX9911
And so on. The regex must check for the correct syntax, beginning from the left.
I have something like that, but this seems not to be correct:
\w\d{0,12}\w{0,6}\d{0,2}\w{0,2}
Big thanks for any help (I'm new to regex)
You could take OpenSauce's regex and then hack it to pieces to allow partial matches:
^[A-Z](\d{0,9}$|\d{9}([A-Z]{0,6}$|[A-Z]{6}(\d{0,2}$|\d{2}([A-Z]{0,2}$))))
It's not pretty but as far as I can tell it encodes your requirements.
Essentially I took each case of something like \d{9} and replaced it with something like (\d{0,9}$|\d{9}<rest of regex>).
I added ^ and $ because otherwise it will match substrings in an otherwise invalid string. For example, it will see an invalid string like XX024 and think it is okay because it contains X024.
If I understand you correctly, your strings should match the regex
[A-Z]\d{9}[A-Z]{6}\d{2}[A-Z]{2}
but you also want to check if a string could be a prefix of a matching string, is that correct? You might be able to express this in a single regex, but I can't think of a way to do so that's easy to read.
You haven't said which language you're using, but if your language gives you a way to tell if the end of the input string was reached while checking the regex, that would give you an easy way to get what you want. E.g. in java, the method Matcher.hitEnd tells you whether the end was reached, so the below code:
static Pattern pattern = Pattern.compile( "[A-Z]\\d{9}[A-Z]{6}\\d{2}[A-Z]{2}" );
static Matcher matcher = pattern.matcher( "" );
public static void main(String[] args) {
String[] strings = {
"X024",
"X024123099X",
"X024123099XYA",
"X024123099XYAAXX99R",
"XX024",
"X02412AA99X",
"X024123099XYAAXX9911"
};
for ( String string : strings ) {
out.format( "%s %s\n", string, inputOK(string) ? "OK" : "not OK" );
}
}
static boolean inputOK(String input) {
return matcher.reset(input).matches() || matcher.hitEnd();
}
gives output:
X024 OK
X024123099X OK
X024123099XYA OK
X024123099XYAAXX99R OK
XX024 not OK
X02412AA99X not OK
X024123099XYAAXX9911 not OK

Regular expression to find two sets of 11 only

Hello guys I need to find a regular expression that takes strings with two sets of 11 only
from a set {0,1,2}
0011110000 match it only has two sets
0010001001100 does not match (only has one set)
0000011000110011 does not match (it has three sets)
00 does not match (it has no set
0001100000110001 match it only has two sets
This is what I've done so far
([^1]|1(0|2|3)(0|2|3)*)*11([^1]|1(0|2|3)(0|2|3)*)*11([^1]|1(0|2|3)(0|2|3)*|1$)*
--------------------------
I think what I'm missing is that I need to make sure the underlined section of the above regular expression has to make sure there is no more "11" left in the string, and I don't think that section is working correctly.
You could use a regular expression, but you've got much simpler options available to you. Here's an example in C#:
public bool IsValidString(string input)
{
return input.Split(new string[] { "11" }, StringSplitOptions.None).Length == 3;
}
Although regular expressions can be a very useful tool, their usage is not always warranted. As jwz put it:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
If this is not homework, then I would suggest avoiding a regex and going with a regular function (shown here is JavaScript):
function hasTwoElevensOnly(s) {
var first = s.indexOf("11");
if (first < 0) return false;
var second = s.indexOf("11", first + 2);
if (second < 0) return false;
return s.indexOf("11", second + 2) < 0;
}
Code here: http://jsfiddle.net/8FMRH/
function hasTwoElevensOnly(s) {
return /^((0|1(?!1)|2)*?11(0|1(?!1)|2)*?){2}$/.test(s);
}
If a regex is required,
COde here: http://jsfiddle.net/PAARn/1/
most of regex comes with the restriction of appearance, usually in {}. For example, in JavaScript, you could do something like:
/^((10|0)*11(01|0)*){2}$/
Which mataches 2 set of 11 prefixed and suffixed with 0+ 0 in the string.
There may be a simpler way, but starting with your approach, this seems to work on the sample data provided:
/^([^1]|1[023])*11([^1]|1[023])*11((?<!11)|1[023]|[023]|(?<=[023])1)*$/
Using lookbehind.