QRegExp not finding expected string pattern - c++

I am working in Qt 5.2, and I have a piece of code that takes in a string and enters one of several if statements based on its format. One of the formats searched for is the letters "RCV", followed by a variable amount of numbers, a decimal, and then one more number. There can be more than one of these values in the line, separated by "|", for example it could one value like "RCV0123456.1" or mulitple values like "RCV12345.1|RCV678.9". Right now I am using QRegExp class to find this, like this:
QString value = "RCV000030249.2|RCV000035360.2"; //Note: real test value from my code
if(QRegExp("^[RCV\d+\.\d\|?]+$").exactMatch(value))
std::cout << ":D" << std::endl;
else
std::cout << ":(" << std::endl;
I want it to use the if statement, but it keeps going into the else statement. Is there something I'm doing wrong with the regular expression?

Your expression should be like #vahancho mentionet in a comment:
if(QRegExp("^[RCV\\d+\\.\\d\\|?]+$").exactMatch(value))
If you use C++11, then you can use its raw strings feature:
if(QRegExp(R"(^[RCV\d+\.\d\|?]+$)").exactMatch(value))

Aside from escaping the backslashes which others has mentioned in answers and comments,
There can be more than one of these values in the line, separated by "|", for example it could one value like "RCV0123456.1" or mulitple values like "RCV12345.1|RCV678.9".
[RCV\d+\.\d\|?] may not be doing what you expect. Perhaps you want () instead of []:
/^
[RCV\d+\.\d\|?]+ # More than one of characters from the list:
# R, C, V, a digit, a +, a dot, a digit, a |, a ?
$/x
/^
(
RCV\d+\.\d # RCV, some digits, a dot, followed by a digit
\|? # Optional: a |
)+ # Quantifier of one or more
$/x
Also, maybe you could revise the regex such that the optional | requires the group to be matched *again*:
/^
(RCV\d+\.\d) # RCV, some digits, a dot, followed by a digit
(
\|(?1) # A |, then match subpattern 1 (Above)
)+ # Quantifier of one or more
$/x

Check if only valid occurences in line with the addition to require an | starting second occurence (having your implementation would not require the | even with double quotes):
QString value = "RCV000030249.2|RCV000035360.2"; //Note: real test value from my code
if(QRegExp("^RCV\\d+\\.\\d(\\|RCV\\d+\\.\\d)*$").exactMatch(value))
std::cout << ":D" << std::endl;
else
std::cout << ":(" << std::endl;

Related

regex_replace is returning empty string

I am trying to remove all characters that are not digit, dot (.), plus/minus sign (+/-) with empty character/string for float conversion.
When I pass my string through regex_replace function I am returned an empty string.
I belive something is wrong with my regex expression std::regex reg_exp("\\D|[^+-.]")
Code
#include <iostream>
#include <regex>
int main()
{
std::string temporary_recieve_data = " S S +456.789 tg\r\n";
std::string::size_type sz;
const std::regex reg_exp("\\D|[^+-.]"); // matches not digit, decimal point (.), plus sign, minus sign
std::string numeric_string = std::regex_replace(temporary_recieve_data, reg_exp, ""); //replace the character that are not digit, dot (.), plus-minus sign (+,-) with empty character/string for float conversion
std::cout << "Numeric String : " << numeric_string << std::endl;
if (numeric_string.empty())
{
return 0;
}
float data_value = std::stof(numeric_string, &sz);
std::cout << "Float Value : " << data_value << std::endl;
return 0;
}
I have been trying to evaluate my regex expression on regex101.com for past 2 days but I am unable to figure out where I am wrong with my regular expression. When I just put \D, the editor substitutes non-digit character properly but soon as I add or condition | for not dot . or plus + or minus - sign the editor returns empty string.
The string is empty because your regex matches each character.
\D already matches every character that is not a digit.
So plus, hyphen and the period thus far are consumed.
And digits get consumed by the negated class: [^+-.]
Further the hyphen indicates a range inside a character class.
Either escape it or put it at the start or end of the char-class.
(funnily the used range +-. 43-46 even contained a hyphen)
Remove the alternation with \D and put \d into the negated class:
[^\d.+-]+
See this demo at regex101 (attaching + for one or more is efficient)

How to retrieve the captured substrings from a capturing group that may repeat?

I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.
Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".
Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.
What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.
If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern
[^:]+
which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.†
In C++ there are different ways to approach this. Using std::regex_iterator
#include <string>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
int main()
{
std::string str{R"(one:two:three)"};
std::regex r{R"([^:]+)"};
std::vector<std::string> result{};
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
result.push_back(match[0].str());
}
std::cout << "Input string: " << str << '\n';
for(auto i : result)
std::cout << i << '\n';
}
Prints as expected.
One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string str{"one:two:three"};
std::regex r{"[^:]+"};
std::smatch res;
std::string::const_iterator search_beg( str.cbegin() );
while ( regex_search( search_beg, str.cend(), res, r ) )
{
std::cout << res[0] << '\n';
search_beg = res.suffix().first;
}
std::cout << '\n';
}
(With this string and regex we don't need the raw string literal so I've removed them here.)
† This question was initially tagged with perl (with no c++), also with an explicit mention of it in text (still there), and the original version of this answer referred to Perl with
/([^:]+)/g
The /g "modifier" is for "global," to find all matches. The // are pattern delimiters.
When this expression is bound (=~) to a variable with a target string then the whole expression returns a list of matches when used in a context in which a list is expected, which can thus be directly assigned to an array variable.
my #captures = $string =~ /[^:]+/g;
(when this is used literally as shown then the capturing () aren't needed)
Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).
Repeating a capture group will only capture the value of the last iteration. Instead, you might make use of the \G anchor to get consecutive matches.
If the whole string can only contain word characters separated by colons:
(?:^(?=\w+(?::\w+)+$)|\G(?!^):)\K\w+
The pattern matches:
(?: Non capture group
^ Assert start of string
(?=\w+(?::\w+)+$) Assert from the current position 1+ word characters and 1+ repetitions of : and 1+ word characters till the end of the string
| Or
\G(?!^): Assert the position at the end of the previous match, not at the start and match :
) Close non capture group
\K\w+ Forget what is matched so far, and match 1+ word characters
Regex demo
To allow only words as well from the start of the string, and allow other chars after the word chars:
\G:?\K\w+
Regex demo

C++11 regex to tokenize Mathematical Expression

I have the following code to tokenize a string of the format: (1+2)/((8))-(100*34):
I'd like to throw an error to the user if they use an operator or character that isn't part of my regex.
e.g if user enters 3^4 or x-6
Is there a way to negate my regex, search for it and if it is true throw the error?
Can the regex expression be improved?
//Using c++11 regex to tokenize input string
//[0-9]+ = 1 or many digits
//Or [\\-\\+\\\\\(\\)\\/\\*] = "-" or "+" or "/" or "*" or "(" or ")"
std::regex e ( "[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]");
std::sregex_iterator rend;
std::sregex_iterator a( infixExpression.begin(), infixExpression.end(), e );
queue<string> infixQueue;
while (a!=rend) {
infixQueue.push(a->str());
++a;
}
return infixQueue;
-Thanks
You can run a search on the string using the search expression [^0-9()+\-*/] defined as C++ string as "[^0-9()+\\-*/]" which finds any character which is NOT a digit, a round bracket, a plus or minus sign (in real hyphen), an asterisk or a slash.
The search with this regular expression search string should not return anything otherwise the string contains a not supported character like ^ or x.
[...] is a positive character class which means find a character being one of the characters in the square brackets.
[^...] is a negative character class which means find a character NOT being one of the characters in the square brackets.
The only characters which must be escaped within square brackets to be interpreted as literal character are ], \ and - whereby - must not be escaped if being first or last character in the list of characters within the square brackets. But it is nevertheless better to escape - always within square brackets as this makes it easier for the regular expression engine / function to detect that the hyphen character should be interpreted as literal character and not with meaning "FROM x to z".
Of course this expression does not check for missing closing round brackets. But formula parsers do often not require that there is always a closing parenthesis for every opening parenthesis in comparison to a compiler or script interpreter simply because not needed to calculate the value based on entered formula.
Answer is given already but perhaps someone might need this
[0-9]?([0-9]*[.])?[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]
This regex separates floats, integers and arithmetic operators
Heres the trick:
[0-9]?([0-9]*[.])?[0-9]+ -> if its a digit and has a point, then grab the digits with the point and the digits that follows it, if not, just grab the digits.
Sorry if my answer isn't clear, i just learned regex and found this solution by my own by just trial and errors.
Heres the code (it takes a mathematical expression and split all digits and operators into a vector)
NOTE: I don't know if it accepts whitespaces, meaning that the mathematical expression that i worked with had no whitespaces. Example: 4+2*(3+1) and would separate everything nicely, but i havent tried with whitespaces.
/* Separate every int or float or operator into a single string using regular expression and store it in untokenize vector */
string infix; //The string to be parse (the arithmetic operation if you will)
vector<string> untokenize;
std::regex words_regex("[0-9]?([0-9]*[.])?[0-9]+|[\\-\\+\\\\\(\\)\\/\\*]");
auto words_begin = std::sregex_iterator(infix.begin(), infix.end(), words_regex);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
cout << (*i).str() << endl;
untokenize.push_back((*i).str());
}
Output:
(<br/>
1<br/>
+<br/>
2<br/>
)<br/>
/<br/>
(<br/>
(<br/>
8<br/>
)<br/>
)<br/>
-<br/>
(<br/>
100<br/>
*<br/>
34<br/>
)<br/>

Why am I getting multiple regex matches?

I'm trying to write a processor for GLSL shader code that will allow me to analyze the code and dynamically determine what inputs and outputs I need to handle for each shader.
To accomplish that, I decided to use some regex to parse the shader code before I compile it via OpenGL.
I've written some test code to verify that the regex is working as I expect.
Code:
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main()
{
string strInput = " in vec3 i_vPosition; ";
smatch match;
// Will appear in regex as:
// \bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(\[[0-9]+\])?\s*;
regex rgx("\\bin\\s+[a-zA-Z0-9]+\\s+[a-zA-Z0-9_]+\\s*(\\[[0-9]+\\])?\\s*;");
bool bMatchFound = regex_search(strInput, match, rgx);
cout << "Match found: " << bMatchFound << endl;
for (int i = 0; i < match.size(); ++i)
{
cout << "match " << i << " (" << match[i] << ") ";
cout << "at position " << match.position(i) << std::endl;
}
}
The only problem is that the above code generates two results instead of one. Though one of the results is empty.
Output:
Match found: 1
match 0 (in vec3 i_vPosition;) at position 6
match 1 () at position 34
I ultimately want to generate multiple results when I provide a whole file as input, but I'd like to get some consistency so that I can process the results in a consistent manner.
Any ideas as to why I'm getting multiple results when I'm only expecting one?
Your regex appears to contain a back reference
(\[[0-9]+\])?
which would contain square brackets surrounding 1 or more digits, but the ? makes it optional.
When applying the regex, the leading and trailing spaces are trimmed by the
\s+ ... \s*
The remainder of the string is matched by
[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*
And the backreference bit matches the empty string.
If you want to match strings that optionally contain that bit, but not return it as a backreference, make it passive with ?: like:
\bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(?:\[[0-9]+\])?\s*
I ultimately want to generate multiple results
The regex_search only finds the first match of the complete regular expression.
If you want to find the other places in your source text that the complete regular expression matches,
you must run regex_search repeatedly.
See
" C++ Regex to match words without punctuation "
for an example of repeatedly running the search.
the above code generates two results instead of one.
Confusing, isn't it?
The regular expression
\bin\s+[a-zA-Z0-9]+\s+[a-zA-Z0-9_]+\s*(\[[0-9]+\])?\s*;
includes round brackets().
The round brackets create a "group" aka "sub-expression".
Because the sub-expression is optional "(....)?",
the expression as a whole is allowed to match even if the sub-expression doesn't really match anything.
When the sub-expression doesn't match anything, the value of that sub-expression is an empty string.
See "Regular-expressions: Use Round Brackets for Grouping" for far more information on "capturing parenthesis" and "non-capturing parenthesis".
According to the documentation for regex_search,
match.size() is the number of subexpressions plus 1,
match[0] is the part of the source string that matches the complete regular expression.
match[1] is the part of the source string that matches the first sub-expression inside the regular expression.
match[n] is the part of the source string that matches the n'th sub-expression inside the regular expression.
A regular expression with only 1 sub-expression, as in the above example, will always return a match.size() of 2 -- one match for the complete regular expression, and one match for the sub-expression -- even when that sub-expression doesn't really match anything and is therefore the empty string.

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}