Ignore String containing special words (Months) - c++

I am trying to find alphanumeric strings by using the following regular expression:
^(?=.*\d)(?=.*[a-zA-Z]).{3,90}$
Alphanumeric string: an alphanumeric string is any string that contains at least a number and a letter plus any other special characters it can be # - _ [] () {} ç _ \ ù %
I want to add an extra constraint to ignore all alphanumerical strings containing the following month formats :
JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
One solution is to actually match an alphanumerical string. Then check if this string contains one of these names by using the following function:
vector<string> findString(string s)
{
vector<string> vec;
boost::regex rgx("JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
");
boost::smatch match;
boost::sregex_iterator begin {s.begin(), s.end(), rgx},
end {};
for (boost::sregex_iterator& i = begin; i != end; ++i)
{
boost::smatch m = *i;
vec.push_back(m.str());
}
return vec;
}
Question: How can I add this constraint directly into the regular expression instead of using this function.
One solution is to use negative lookahead as mentioned in How to ignore words in string using Regular Expressions.
I used it as follows:
String : 2-hello-001
Regular expression : ^(?=.*\d)(?=.*[a-zA-Z]^(?!Jan|Feb|Mar)).{3,90}$
Result: no match
Test website: http://regexlib.com/
The edit provided by #Robin and #RyanCarlson : ^[][\w#_(){}ç\\ù%-]{3,90}$ works perfectly in detecting alphanumeric strings with special characters. It's just the negative lookahead part that isn't working.

You can use negative look ahead, the same way you're using positive lookahead:
(?=.*\d)(?=.*[a-zA-Z])
(?!.*(?:JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre)).{3,90}$
Also you regex is pretty unclear. If you want alphanumerical strings with a length between 3 and 90, you can just do:
/^(?!.*(?:JANVIER|F[Eé]VRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AO[Uù]T|SEPTEMBRE|OCTOBRE|NOVEMBRE|D[Eé]CEMBRE|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))
[][\w#_(){}ç\\ù%-]{3,90}$/i
the i flag means it will match upper and lower case (so you can reduce your forbidden list), \w is a shortcut for [0-9a-zA-Z_] (careful if you copy-paste, there's a linebreak here for readability between (?! ) and [ ]). Just add in the final [...] whatever special characters you wanna match.

Related

Why does the regex [a-zA-Z]{5} return true for non-matching string?

I defined a regular expression to check if the string only contains alphabetic characters and with length 5:
use regex::Regex;
fn main() {
let re = Regex::new("[a-zA-Z]{5}").unwrap();
println!("{}", re.is_match("this-shouldn't-return-true#"));
}
The text I use contains many illegal characters and is longer than 5 characters, so why does this return true?
You have to put it inside ^...$ to match the whole string and not just parts:
use regex::Regex;
fn main() {
let re = Regex::new("^[a-zA-Z]{5}$").unwrap();
println!("{}", re.is_match("this-shouldn't-return-true#"));
}
Playground.
As explained in the docs:
Notice the use of the ^ and $ anchors. In this crate, every expression is executed with an implicit .*? at the beginning and end, which allows it to match anywhere in the text. Anchors can be used to ensure that the full text matches an expression.
Your pattern returns true because it matches any consecutive 5 alpha chars, in your case it matches both 'shouldn't' and 'return'.
Change your regex to: ^[a-zA-Z]{5}$
^ start of string
[a-zA-Z]{5} matches 5 alpha chars
$ end of string
This will match a string only if the string has a length of 5 chars and all of the chars from start to end fall in range a-z and A-Z.

How to find the exact substring with regex in c++11?

I am trying to find substrings that are not surrounded by other a-zA-Z0-9 symbols.
For example: I want to find substring hello, so it won't match hello1 or hellow but will match Hello and heLLo!##$%.
And I have such sample below.
std::string s = "1mySymbol1, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]*" + sub + "[^a-zA-Z0-9]*", std::regex::icase);
std::smatch match;
while (std::regex_search(s, match, rgx)) {
std::cout << match.size() << "match: " << match[0] << '\n';
s = match.suffix();
}
The result is:
1match: mySymbol
1match: , /_mySymbol_
1match: mysymbol
But I don't understand why first occurance 1mySymbol1 also matches my regex?
How to create a proper regex that will ignore such strings?
UDP
If I do like this
std::string s = "mySymbol, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]+" + sub + "[^a-zA-Z0-9]+", std::regex::icase);
then I find only substring in the middle
1match: , /_mySymbol_
And don't find substrings at the beggining and at the end.
The regex [^a-zA-Z0-9]* will match 0 or more characters, so it's perfectly valid for [^a-zA-Z0-9]*mysymbol[^a-zA-Z0-9]* to match mysymbol in 1mySymbol1 (allowing for case insensitivity). As you saw, this is fixed when you use [^a-zA-Z0-9]+ (matching 1 or more characters) instead.
With your update, you see that this doesn't match strings at the beginning or end. That's because [^a-zA-Z0-9]+ has to match 1 or more characters (which don't exist at the beginning or end of the string).
You have a few options:
Use beginning/end anchors: (?:[^a-zA-Z0-9]+|^)mysymbol(?:[^a-zA-Z0-9]+|$) (non-alphanumeric OR beginning of string, followed by mysymbol, followed by non-alphanumeric OR end of string).
Use negative lookahead and negative lookbehind: (?<![a-zA-Z0-9])mysymbol(?![a-zA-Z0-9]) (match mysymbol which doesn't have an alphanumeric character before or after it). Note that using this the match won't include the characters before/after mysymbol.
I recommend using https://regex101.com/ to play around with regular expressions. It lists all the different constructs you can use.

C++11 regex to tokenize Mathematical Expression

I have the following code to tokenize a string of the format: (1+2)/((8))-(100*34):
I'd like to throw an error to the user if they use an operator or character that isn't part of my regex.
e.g if user enters 3^4 or x-6
Is there a way to negate my regex, search for it and if it is true throw the error?
Can the regex expression be improved?
//Using c++11 regex to tokenize input string
//[0-9]+ = 1 or many digits
//Or [\\-\\+\\\\\(\\)\\/\\*] = "-" or "+" or "/" or "*" or "(" or ")"
std::regex e ( "[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]");
std::sregex_iterator rend;
std::sregex_iterator a( infixExpression.begin(), infixExpression.end(), e );
queue<string> infixQueue;
while (a!=rend) {
infixQueue.push(a->str());
++a;
}
return infixQueue;
-Thanks
You can run a search on the string using the search expression [^0-9()+\-*/] defined as C++ string as "[^0-9()+\\-*/]" which finds any character which is NOT a digit, a round bracket, a plus or minus sign (in real hyphen), an asterisk or a slash.
The search with this regular expression search string should not return anything otherwise the string contains a not supported character like ^ or x.
[...] is a positive character class which means find a character being one of the characters in the square brackets.
[^...] is a negative character class which means find a character NOT being one of the characters in the square brackets.
The only characters which must be escaped within square brackets to be interpreted as literal character are ], \ and - whereby - must not be escaped if being first or last character in the list of characters within the square brackets. But it is nevertheless better to escape - always within square brackets as this makes it easier for the regular expression engine / function to detect that the hyphen character should be interpreted as literal character and not with meaning "FROM x to z".
Of course this expression does not check for missing closing round brackets. But formula parsers do often not require that there is always a closing parenthesis for every opening parenthesis in comparison to a compiler or script interpreter simply because not needed to calculate the value based on entered formula.
Answer is given already but perhaps someone might need this
[0-9]?([0-9]*[.])?[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]
This regex separates floats, integers and arithmetic operators
Heres the trick:
[0-9]?([0-9]*[.])?[0-9]+ -> if its a digit and has a point, then grab the digits with the point and the digits that follows it, if not, just grab the digits.
Sorry if my answer isn't clear, i just learned regex and found this solution by my own by just trial and errors.
Heres the code (it takes a mathematical expression and split all digits and operators into a vector)
NOTE: I don't know if it accepts whitespaces, meaning that the mathematical expression that i worked with had no whitespaces. Example: 4+2*(3+1) and would separate everything nicely, but i havent tried with whitespaces.
/* Separate every int or float or operator into a single string using regular expression and store it in untokenize vector */
string infix; //The string to be parse (the arithmetic operation if you will)
vector<string> untokenize;
std::regex words_regex("[0-9]?([0-9]*[.])?[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]");
auto words_begin = std::sregex_iterator(infix.begin(), infix.end(), words_regex);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
cout << (*i).str() << endl;
untokenize.push_back((*i).str());
}
Output:
(<br/>
1<br/>
+<br/>
2<br/>
)<br/>
/<br/>
(<br/>
(<br/>
8<br/>
)<br/>
)<br/>
-<br/>
(<br/>
100<br/>
*<br/>
34<br/>
)<br/>

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}

Contextual Regular Expression

I have a list of comma separated words that I want to remove the comma from and replace with a space:
elements-(a,b,c,d)
becomes:
elements-(a b c d)
The question is how can I do this using a regular expression if and only if that list is within a specific context, e.g. only prefixed by element-():
The following:
There are a number of elements-(a,b,c,d) and a number of other elements-(e,f,g,h)
should become:
There are a number of elements-(a b c d) and a number of other elements-(e f g h)
What would be the correct way to do this with regex?
For contextual regular expressions, you can use zero-width look-around assertions. Look-around assertions are used to assert that something must be true in order for the match to succeed, but they do not consume any characters (hence "zero-width").
In your case, you want to use positive look-behind and look-ahead assertions. In C#, you can do the following:
static string Replace(string text)
{
return Regex.Replace(
text,
#"(?<=elements\-\((\w+,)*)(\w+),(?=(\w+,)*\w+\))",
"$2 "
);
}
There are three basic parts to the pattern here (in order):
(?<=elements\-\((\w+,)*) - this is the positive look-behind assertion. It says that the pattern will only match if it is preceded by the text elements-( and zero-or-more comma-separated strings.
(\w+), - this is the actual match. It's the text that's being replaced.
(?=(\w+,)*\w+\)) - this is the positive look-ahead assertion. It says that the pattern will only match if it is followed by one-or-more comma-separated strings.
In C#, for matching the inner comma-separated contents, you can alternatively do the following:
static string Replace(string text)
{
return Regex.Replace(
text,
#"(?<=elements\-)\(((\w+,)+\w+)\)",
m => string.Format("({0})", m.Groups[1].Value.Replace(',', ' '))
);
}
The basic approach with the positive look-ahead assertion is still the same.
Example output:
"(x,y,z) elements-(a,b) (m,m,m) elements-(c,d,e,f,g,h)"
...becomes...
"(x,y,z) elements-(a b) (m,m,m) elements-(c d e f g h)"