C++ Regex getting all match's on line

C++ Regex getting all match's on line - c++

When reading line by line i call this function on each line looking for function calls(names). I use this function to match the any valid characters a-z 0-9 and _ with '('. My problem is i do not understand fully the c++ style regex and how to get it to look through the entire line for possible matches?. This regex is simple and strait forward just does not work as expected but im learning this is the c++ norm.
void readCallbacks(const std::string lines)
{
std::string regxString = "[a-z0-9]+\(";
regex regx(regxString, std::regex_constants::icase);
smatch result;
if(regex_search(lines.begin(), lines.end(), result, regx, std::regex_constants::match_not_bol))
{
cout << result.str() << "\n";
}
}

You need to escape the backslash or use a raw string literal:
std::regex pattern("[a-z0-9]+\\(", std::regex_constants::icase);
// ^^
std::regex pattern(R"([a-z0-9]+\()", std::regex_constants::icase);
// ###^^^^^^^^^^^##
Also, your character range doesn't contain the desired underscore (_).

Related

How to find the exact substring with regex in c++11?

I am trying to find substrings that are not surrounded by other a-zA-Z0-9 symbols.
For example: I want to find substring hello, so it won't match hello1 or hellow but will match Hello and heLLo!##$%.
And I have such sample below.
std::string s = "1mySymbol1, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]*" + sub + "[^a-zA-Z0-9]*", std::regex::icase);
std::smatch match;
while (std::regex_search(s, match, rgx)) {
std::cout << match.size() << "match: " << match[0] << '\n';
s = match.suffix();
}
The result is:
1match: mySymbol
1match: , /_mySymbol_
1match: mysymbol
But I don't understand why first occurance 1mySymbol1 also matches my regex?
How to create a proper regex that will ignore such strings?
UDP
If I do like this
std::string s = "mySymbol, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]+" + sub + "[^a-zA-Z0-9]+", std::regex::icase);
then I find only substring in the middle
1match: , /_mySymbol_
And don't find substrings at the beggining and at the end.

The regex [^a-zA-Z0-9]* will match 0 or more characters, so it's perfectly valid for [^a-zA-Z0-9]*mysymbol[^a-zA-Z0-9]* to match mysymbol in 1mySymbol1 (allowing for case insensitivity). As you saw, this is fixed when you use [^a-zA-Z0-9]+ (matching 1 or more characters) instead.
With your update, you see that this doesn't match strings at the beginning or end. That's because [^a-zA-Z0-9]+ has to match 1 or more characters (which don't exist at the beginning or end of the string).
You have a few options:
Use beginning/end anchors: (?:[^a-zA-Z0-9]+|^)mysymbol(?:[^a-zA-Z0-9]+|$) (non-alphanumeric OR beginning of string, followed by mysymbol, followed by non-alphanumeric OR end of string).
Use negative lookahead and negative lookbehind: (?<![a-zA-Z0-9])mysymbol(?![a-zA-Z0-9]) (match mysymbol which doesn't have an alphanumeric character before or after it). Note that using this the match won't include the characters before/after mysymbol.
I recommend using https://regex101.com/ to play around with regular expressions. It lists all the different constructs you can use.

c++11/regex - search for exact string, escape [duplicate]

This question already has answers here:
std::regex escape special characters for use in regex
(3 answers)
Closed 6 years ago.
Say you have a string which is provided by the user. It can contain any kind of character. Examples are:
std::string s1{"hello world");
std::string s1{".*");
std::string s1{"*{}97(}{.}}\\testing___just a --%#$%# literal%$#%^"};
...
Now I want to search in some text for occurrences of >> followed by the input string s1 followed by <<. For this, I have the following code:
std::string input; // the input text
std::regex regex{">> " + s1 + " <<"};
if (std::regex_match(input, regex)) {
// add logic here
}
This works fine if s1 did not contain any special characters. However, if s1 had some special characters, which are recognized by the regex engine, it doesn't work.
How can I escape s1 such that std::regex considers it as a literal, and therefore does not interpret s1? In other words, the regex should be:
std::regex regex{">> " + ESCAPE(s1) + " <<"};
Is there a function like ESCAPE() in std?
important I simplified my question. In my real case, the regex is much more complex. As I am only having troubles with the fact the s1 is interpreted, I left these details out.

You will have to escape all special characters in the string with \. The most straightforward approach would be to use another expression to sanitize the input string before creating the expression regex.
// matches any characters that need to be escaped in RegEx
std::regex specialChars { R"([-[\]{}()*+?.,\^$|#\s])" };
std::string input = ">> "+ s1 +" <<";
std::string sanitized = std::regex_replace( input, specialChars, R"(\$&)" );
// "sanitized" can now safely be used in another expression

Regex_match doesn't match in a file

I want to recognize some lines in a text file using regex, but regex_match doesn't match any line, even if I use regex patron(".*")
string dirin = "/home/user/in.srt";
string dirout = "/home/user/out.srt";
ifstream in(dirin.c_str());
ofstream out(dirout.c_str());
string line;
// regex patron("(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})\\s-->\\s(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})");
regex patron(".*");
smatch m;
while (getline(in, line)) {
if (regex_match(line, m, patron)) {
out << "ok";
};
out << line;
}
in.close();
out.close();
The code always print the string line in the out.srt file, but never the string "ok" inside the if (regex_match(line, m, patron)).
I'm testing it with the following lines
1
00:01:00,708 --> 00:01:01,800
You look at that river
2
00:01:02,977 --> 00:01:04,706
gently flowing by.
3
00:01:06,213 --> 00:01:08,238
You notice the leaves

Note that getline() reads a line with a trailing carriage return CR symbol, and note that ECMAScript . pattern does not match CR symbol considering it an end of line symbol.
regex_match requires that a whole string matches the pattern.
Thus, you need to account for an optional carriage return at the end of the pattern. You can do it by appending \r? or \s* at the end of the pattern:
regex patron("(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})\\s-->\\s(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})\\s*");
or
regex patron(".*\\s*");
Also, consider using raw string literals if your C++ version allows it:
regex patron(R"((\d{2}):(\d{2}):(\d{2}),(\d{3})\s-->\s(\d{2}):(\d{2}):(\d{2}),(\d{3})\s*)");

Tokenizing boost::regex matches

I have created a regex to match the lines of a file which have the following structure: string int int
int main()
{
std::string line;
boost::regex pat("\\w\\s\\d\\s\\d");
while (std::cin)
{
std::getline(std::cin, line);
boost::smatch matches;
if (boost::regex_match(line, matches, pat))
std::cout << matches[2] << std::endl;
}
}
I would like to save those numbers into a pair<string,pair<int,int>>. How can I tokenize match of the boost:regex to achieve this?

First of all your regular expression accepts "one word character then one space then one digit then one space then one digit", I bet this is not what you want. Most probably you want your expression look like:
\w+\s+\d+\s+\d+
where \w+ now means "one or more word characters". If you are sure that there is only one space between tokens you can omit + after \s but this way it is safer.
Then you need to select parts of your expression. That is called sub-expression in RE:
(\w+)\s+(\d+)\s+(\d+)
this way what matches by (\w+) (one or more word characters) will be in matches[1], first (\d+) in matches[2] and second (\d+) in matches[3]. Of course you need to put double \ when you put it in C++ string.

C++ regex escaping punctional characters like "."

Matching a "." in a string with the std::tr1::regex class makes me use a weird workaround.
Why do I need to check for "\\\\." instead of "\\."?
regex(".") // Matches everything (but "\n") as expected.
regex("\\.") // Matches everything (but "\n").
regex("\\\\.") // Matches only ".".
Can someone explain me why? It's really bothering me since I had my code written using boost::regex classes, which didn't need this syntax.
Edit: Sorry, regex("\\\\.") seems to match nothing.
Edit2: Some code
void parser::lex(regex& token)
{
// Skipping whitespaces
{
regex ws("\\s*");
sregex_token_iterator wit(source.begin() + pos, source.end(), ws, regex_constants::match_default), wend;
if(wit != wend)
pos += (*wit).length();
}
sregex_token_iterator it(source.begin() + pos, source.end(), token, regex_constants::match_default), end;
if (it != end)
temp = *it;
else
temp = "";
}

This is because \. is interpreted as an escape sequence, which the language itself is trying to interpret as a single character. What you want is for your regex to contain the actual string "\.", which is written \\. because \\ is the escape sequence for the backslash character (\).

As it turns out, the actual problem was due to the way sregex_token_iterator was used. Using match_default meant it was always finding the next match in the string, if any, even if there is a non-match in-between. That is,
string source = "AAA.BBB";
regex dot("\\.");
sregex_token_iterator wit(source.begin(), source.end(), dot, regex_constants::match_default);
would give a match at the dot, rather than reporting that there was no match.
The solution is to use match_continuous instead.

Try to escape the dot by its ASCII code:
regex("\\x2E")

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Regex getting all match's on line - c++

You need to escape the backslash or use a raw string literal: std::regex pattern("[a-z0-9]+\\(", std::regex_constants::icase); // ^^ std::regex pattern(R"([a-z0-9]+\()", std::regex_constants::icase); // ###^^^^^^^^^^^## Also, your character range doesn't contain the desired underscore (_).

Related

How to find the exact substring with regex in c++11?

c++11/regex - search for exact string, escape [duplicate]

Regex_match doesn't match in a file

Tokenizing boost::regex matches

C++ regex escaping punctional characters like "."

Categories

Resources