Regex Extractor - c++

Hi I am implementing regex using C++ .
Background:
I have a std::string and a std::regex. I need to compare the string against this regex .
The regex used here is not about validation . My typical regex would be something
like
a[bc]{2} and nothing beyond this scope
I have to pass this regex as a char pointer argument to a function .
Problem:
I am unable to assign char pointer to std::regex. If I do so I am getting the following error.
terminate called after throwing an instance of std::regex_error what(): regex_error
My function body will be
std::string s((char*)a); // The main string
std::regex e((char*)b); // Regex comparing the main string. a and b are the parameters to the function
if (std::regex_match(s, e))
{
// returns the matched portion of the string
// for instance "abcdeef" , "e{2}" would return ee
}
else
{
// return "Mismatch"
}
Any suggestions..? Or is there a way to extract the string from regex like "a{2}b" -> "aab"
Thanks in advance

The error is probably raised due to the ECMAScript syntax for the expression which doesn't support [] in regex. You can try with using the basic regex constrain tag.
std::regex e((char*)b, std::regex_constants::basic);
They are discussing more about this here.

Related

How to build a Raw string for regex from string variable

How build a regex from a string variable, and interpret that as Raw format.
std::regex re{R"pattern"};
For the above code, is there a way to replace the fixed string "pattern" with a std::string pattern; variable that is either built from compile time or run time.
I tried this but didn't work:
std::string pattern = "key";
std::string pattern = std::string("R(\"") + pattern + ")\"";
std::regex re(pattern); // does not work as if it should when write re(R"key")
Specifically, the if using re(R("key") the result is found as expected. But building using re(pattern) with pattern is exactly the same value ("key"), it did not find the result.
This is probably what I need, but it was for Java, not sure if there is anything similar in C++:
How do you use a variable in a regular expression?
std::string pattern = std::string("R(\"") + pattern + ")\"";
should be build from raw string literals as follows
pattern = std::string(R"(\")") + pattern + std::string(R"(\")");
This results in a string value like
\"key\"
See a working live example;
In case you want to have escaped parenthesis, you can write
pattern = std::string(R"(\(")") + pattern + std::string(R"("\))");
This results in a string value like
\("key"\)
Live example
Side note: You can't define the pattern variable twice. Omit the std::string type in follow up uses.

How do I create a Regex from a user-provided string which contains regex metacharacters?

I need to create a regular expression using the regex crate which includes a string passed as a command line argument to the program. The command line argument can contain $ and {}.
If I hard code the string as r"...", then it works fine, but if I use the command line argument as format!(r#"{}"#, arg_str), I get the following error (assuming arg_str = ${replace}) :
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
${replace}
^
error: decimal literal empty
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)', libcore/result.rs:945:5
note: Run with `RUST_BACKTRACE=1` for a backtrace.
Simplified code example to demonstrate this issue:
extern crate regex;
use regex::Regex;
fn main() {
let args: Vec<_> = std::env::args().collect();
let ref arg_str = args[1];
let re = Regex::new(format!(r#"{}"#, arg_str).as_str()).unwrap();
println!("{:?}", re);
}
If this is run with a simple argument like replace, there is no error, but if I pass it something like ${replace}, I get the error mentioned above.
The regex crate has a function escape which does what you need.
From the documentation:
Function regex::escape
pub fn escape(text: &str) -> String
Escapes all regular expression meta characters in text.
The string returned may be safely used as a literal in a regular expression.
So passing your arg_str through regex::escape should fix your problem.

Perl: Help writing a regular expression

I am trying to write a common regular expression for the below 3 cases:
Supernatural_S07E23_720p_HDTV_X264-DIMENSION.mkv
the.listener.313.480p.hdtv.x264-2hd.mkv
How.I.met.your.mother.s02e07.hdtv.x264-xor.avi
Now my regular exoression should remove the series name from the original string i,e the output of above string will be:
S07E23_720p_HDTV_X264-DIMENSION.mkv
313.480p.hdtv.x264-2hd.mkv
s02e07.hdtv.x264-xor.avi
Now for the basic case of supernatural string I wrote the below regex and it worked fine but as soon as the series name got multiple words it fails.
$string =~ s/^(.*?)[\.\_\- ]//i; #delimiter can be (. - _ )
So, I have no idea how to proceed for the aboves cases I was thinking along the lines of \w+{1,6} but it also failed to do the required.
PS: Explanation of what the regular expression is doing will be appreciated.
you can detect if the .'s next token contains digit, if not, consider it as part of the name.
HOWEVER, I personally think there is no perfect solution for this. it'd still meet problem for something like:
24.313.480p.hdtv.x264-2hd.mkv // 24
Warehouse.13.s02e07.hdtv.x264-xor.avi // warehouse 13
As StanleyZ said, you'll always get into trouble with names containing numbers.
But, if you take these special cases appart, you can try :
#perl
$\=$/;
map {
if (/^([\w\.]+)[\.\_]([SE\d]+[\.\_].*)$/i) {
print "Match : Name='$1' Suffix='$2'";
} else {
print "Did not match $_";
}
}
qw!
Supernatural_S07E23_720p_HDTV_X264-DIMENSION.mkv
the.listener.313.480p.hdtv.x264-2hd.mkv
How.I.met.your.mother.s02e07.hdtv.x264-xor.avi
!;
which outputs :
Match : Name='Supernatural' Suffix='S07E23_720p_HDTV_X264-DIMENSION.mkv'
Match : Name='the.listener' Suffix='313.480p.hdtv.x264-2hd.mkv'
Match : Name='How.I.met.your.mother' Suffix='s02e07.hdtv.x264-xor.avi'
note : aren't you doing something illegal ? ;)

Need Regex Help

Can anybody help me with a regex? I have a string with digits like:
X024123099XYAAXX99RR
I need a regex to check if a user has inserted the correct information. The rule should have also a fallback that the input is checked from left to right.
For example, when tested these inputs should return TRUE:
X024
X024123099X
X024123099XYA
X024123099XYAAXX99R
And these ones should return FALSE:
XX024
X02412AA99X
X024123099XYAAXX9911
And so on. The regex must check for the correct syntax, beginning from the left.
I have something like that, but this seems not to be correct:
\w\d{0,12}\w{0,6}\d{0,2}\w{0,2}
Big thanks for any help (I'm new to regex)
You could take OpenSauce's regex and then hack it to pieces to allow partial matches:
^[A-Z](\d{0,9}$|\d{9}([A-Z]{0,6}$|[A-Z]{6}(\d{0,2}$|\d{2}([A-Z]{0,2}$))))
It's not pretty but as far as I can tell it encodes your requirements.
Essentially I took each case of something like \d{9} and replaced it with something like (\d{0,9}$|\d{9}<rest of regex>).
I added ^ and $ because otherwise it will match substrings in an otherwise invalid string. For example, it will see an invalid string like XX024 and think it is okay because it contains X024.
If I understand you correctly, your strings should match the regex
[A-Z]\d{9}[A-Z]{6}\d{2}[A-Z]{2}
but you also want to check if a string could be a prefix of a matching string, is that correct? You might be able to express this in a single regex, but I can't think of a way to do so that's easy to read.
You haven't said which language you're using, but if your language gives you a way to tell if the end of the input string was reached while checking the regex, that would give you an easy way to get what you want. E.g. in java, the method Matcher.hitEnd tells you whether the end was reached, so the below code:
static Pattern pattern = Pattern.compile( "[A-Z]\\d{9}[A-Z]{6}\\d{2}[A-Z]{2}" );
static Matcher matcher = pattern.matcher( "" );
public static void main(String[] args) {
String[] strings = {
"X024",
"X024123099X",
"X024123099XYA",
"X024123099XYAAXX99R",
"XX024",
"X02412AA99X",
"X024123099XYAAXX9911"
};
for ( String string : strings ) {
out.format( "%s %s\n", string, inputOK(string) ? "OK" : "not OK" );
}
}
static boolean inputOK(String input) {
return matcher.reset(input).matches() || matcher.hitEnd();
}
gives output:
X024 OK
X024123099X OK
X024123099XYA OK
X024123099XYAAXX99R OK
XX024 not OK
X02412AA99X not OK
X024123099XYAAXX9911 not OK

Iterative regex matching

We've become fairly adept at generating various regular expressions to match input strings, but we've been asked to try to validate these strings iteratively. Is there an easy way to iteratively match the input string against a regular expression?
Take, for instance, the following regular expression:
[EW]\d{1,3}\.\d
When the user enters "E123.4", the regular expression is met. How do I validate the user's input while they type it? Can I partially match the string "E1" against the regular expression?
Is there some way to say that the input string only partially matched the input? Or is there a way to generate sub-expressions out of the master expression automatically based on string length?
I'm trying to create a generic function that can take any regular expression and throw an exception as soon as the user enters something that cannot meet the expression. Our expressions are rather simple in the grand scheme of things, and we are certainly not trying to parse HTML :)
Thanks in advance.
David
You could do it only by making every part of the regex optional, and repeating yourself:
^([EW]|[EW]\d{1,3}|[EW]\d{1,3}\.|[EW]\d{1,3}\.\d)$
This might work for simple expressions, but for complex ones this is hardly feasible.
Hard to say... If the user types an "E", that matches the begining but not the rest. Of course, you don't know if they will continue to type "123.4" or if they will just hit "Enter" (I assume you use "Enter" to indicate the end of input) right away. You could use groups to test that all 3 groups match, such as:
([EW])(\d{1,3})(\.\d)
After the first character, try to match the first group. After the next few inputs, match the first AND second group, and when they enter the '.' and last digit you have to find a match for all 3 groups.
You could use partial matches if your regex lib supports it (as does Boost.Regex).
Adapting the is_possible_card_number example on this page to the example in your question:
#include <boost/regex.hpp>
// Return false for partial match, true for full match, or throw for
// impossible match
bool
CheckPartialMatch(const std::string& Input, const boost::regex& Regex)
{
boost::match_results<std::string::const_iterator> what;
if(0 == boost::regex_match(Input, what, Regex, boost::match_default | boost::match_partial))
{
// the input so far could not possibly be valid so reject it:
throw std::runtime_error(
"Invalid data entered - this could not possibly be a match");
}
// OK so far so good, but have we finished?
if(what[0].matched)
{
// excellent, we have a result:
return true;
}
// what we have so far is only a partial match...
return false;
}
int main()
{
const boost::regex r("[EW]\\d{1,3}\\.\\d");
// The input is incomplete, so we expect a "false" result
assert(!CheckPartialMatch("E1", r));
// The input completely satisfies the expression, so expect a "true" result
assert(CheckPartialMatch("E123.4", r));
try{
// Input can't match the expression, so expect an exception.
CheckPartialMatch("EX3", r);
assert(false);
}
catch(const std::runtime_error&){
}
return 0;
}