std regex_search to match only current line - c++

I use a various regexes to parse a C source file, line by line. First i read all the content of file in a string:
ifstream file_stream("commented.cpp",ifstream::binary);
std::string txt((std::istreambuf_iterator<char>(file_stream)),
std::istreambuf_iterator<char>());
Then i use a set of regex, which should be applied continusly until the match found, here i will give only one for example:
vector<regex> rules = { regex("^//[^\n]*$") };
char * search =(char*)txt.c_str();
int position = 0, length = 0;
for (int i = 0; i < rules.size(); i++) {
cmatch match;
if (regex_search(search + position, match, rules[i],regex_constants::match_not_bol | regex_constants::match_not_eol))
{
position += ( match.position() + match.length() );
}
}
But it don't work. It will match the comment not in the current line, but it will search whole string, for the first match, regex_constants::match_not_bol and regex_constants::match_not_eol should make the regex_search to recognize ^$ as start/end of line only, not end start/end of whole block. So here is my file:
commented.cpp:
#include <stdio.h>
//comment
The code should fail, my logic is with those options to regex_search, the match should fail, because it should search for pattern in the first line:
#include <stdio.h>
But instead it searches whole string, and immideatly finds //comment. I need help, to make regex_search match only in current line. The options match_not_bol and match_not_eol do not help me. Of course i can read a file line by line in a vector, and then do match of all rules on each string in vector, but it is very slow, i have done that, and it take too long time to parse a big file like that, that's why i want to let regex deal with new lines, and use positioning counter.

If it is not what you want please comment so I will delete the answer
What you are doing is not a correct way of using a regex library.
Thus here is my suggestion for anyone that wants to use std::regex library.
It only supports ECMAScript that somehow is a little
poor than all modern regex library.
It has bugs as many as you like ( just I found ):
the same regex but different results on Linux and Windows only C++
std::regex and ignoring flags
std::regex_match and lazy quantifier with strange behavior
In some cases (I test specifically with std::match_results ) It is 200 times slower in comparison to std.regex in d language
It has very confusing flag-match and almost it does not work (at least for me)
conclusion: do not use it at all.
But if anyone still demands to use c++ anyway then you can:
use boost::regex about Boost library because:
It is PCRE support
It has less bug ( I have not seen any )
It is smaller in bin file ( I mean executable file after compiling )
It is faster then std::regex
use gcc version 7.1.0 and NOT below. The last bug I found is in version 6.3.0
use clang version 3 or above
If you have enticed (= persuade) to NOT use c++ then you can use:
Use d regular expression link library for large task: std.regex and why:
Fast Faster Command Line Tools in D
Easy
Flexible drn
Use native pcre or pcre2 link that have been written in c
Extremely fast but a little complicated
Use perl for a simple task and specially Perl one-liner link

#include <stdio.h>
//comment
The code should fail, my logic is with those options to regex_search, the match should fail, because it should search for pattern in the first line:
#include <stdio.h>
But instead it searches whole string, and immideatly finds //comment. I need help, to make regex_search match only in current line.
Are you trying to match all // comments in a source code file, or only the first line?
The former can be done like this:
#include <iostream>
#include <fstream>
#include <regex>
int main()
{
auto input = std::ifstream{"stream_union.h"};
for(auto line = std::string{}; getline(input, line); )
{
auto submatch = std::smatch{};
auto pattern = std::regex(R"(//)");
std::regex_search(line, submatch, pattern);
auto match = submatch.str(0);
if(match.empty()) continue;
std::cout << line << std::endl;
}
std::cout << std::endl;
return EXIT_SUCCESS;
}
And the later can be done like this:
#include <iostream>
#include <fstream>
#include <regex>
int main()
{
auto input = std::ifstream{"stream_union.h"};
auto line = std::string{};
getline(input, line);
auto submatch = std::smatch{};
auto pattern = std::regex(R"(//)");
std::regex_search(line, submatch, pattern);
auto match = submatch.str(0);
if(match.empty()) { return EXIT_FAILURE; }
std::cout << line << std::endl;
return EXIT_SUCCESS;
}
If for any reason you're trying to get the position of the match, tellg() will do that for you.

Related

How can I use the C++ regex library to find a match and *then* replace it?

I am writing what amounts to a tiny DSL in which each script is read from a single string, like this:
"func1;func2;func1;4*func3;func1"
I need to expand the loops, so that the expanded script is:
"func1;func2;func1;func3;func3;func3;func3;func1"
I have used the C++ standard regex library with the following regex to find those loops:
regex REGEX_SIMPLE_LOOP(":?[0-9]+)\\*([_a-zA-Z][_a-zA-Z0-9]*;");
smatch match;
bool found = std::regex_search(*this, match, std::regex(REGEX_SIMPLE_LOOP));
Now, it's not too difficult to read out the loop multiplier and print the function N times, but how do I then replace the original match with this string? I want to do this:
if (found) match[0].replace(new_string);
But I don't see that the library can do this.
My backup place is to regex_search, then construct the new string, and then use regex_replace, but it seems clunky and inefficient and not nice to essentially do two full searches like that. Is there a cleaner way?
You can also NOT use regex, the parsing isn't too difficult.
So regex might be overkill. Demo here : https://onlinegdb.com/RXLqLtrUQ-
(and yes my output gives an extra ; at the end)
#include <string>
#include <sstream>
#include <iostream>
int main()
{
std::istringstream is{ "func1;func2;func1;4*func3;func1" };
std::string split;
// use getline to split
while (std::getline(is, split, ';'))
{
// assume 1 repeat
std::size_t count = 1;
// if split part starts with a digit
if (std::isdigit(split.front()))
{
// look for a *
auto pos = split.find('*');
// the first part of the string contains the repeat count
auto count_str = split.substr(0, pos);
// convert that to a value
count = std::stoi(count_str);
// and keep the rest ("funcn")
split = split.substr(pos + 1, split.size() - pos - 1);
}
// now use the repeat count to build the output string
for (std::size_t n = 0; n < count; ++n)
{
std::cout << split << ";";
}
}
// TODO invalid input string handling.
return 0;
}

detect new line using C++ boost regex_match [duplicate]

I just started using Boost::regex today and am quite a novice in Regular Expressions too. I have been using "The Regulator" and Expresso to test my regex and seem satisfied with what I see there, but transferring that regex to boost, does not seem to do what I want it to do. Any pointers to help me a solution would be most welcome. As a side question are there any tools that would help me test my regex against boost.regex?
using namespace boost;
using namespace std;
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("\\d*");
vector<string> vs;
cmatch matches;
if( regex_match(s.c_str(), matches, re) ) {
MessageBox(NULL, L"Hmmm", L"", MB_OK); // it never gets here
for( unsigned int i = 1 ; i < matches.size() ; ++i ) {
string match(matches[i].first, matches[i].second);
vs.push_back(match);
}
}
return vs;
}
void _uttokenizer::test_to_vector_int()
{
vector<string> __vi = tokenizer::to_vector_int("0<br/>1");
for( int i = 0 ; i < __vi.size() ; ++i ) INFO(__vi[i]);
CPPUNIT_ASSERT_EQUAL(2, (int)__vi.size());//always fails
}
Update (Thanks to Dav for helping me clarify my question):
I was hoping to get a vector with 2 strings in them => "0" and "1". I instead never get a successful regex_match() (regex_match() always returns false) so the vector is always empty.
Thanks '1800 INFORMATION' for your suggestions. The to_vector_int() method now looks like this, but it goes into a never ending loop (I took the code you gave and modified it to make it compilable) and find "0","","","" and so on. It never find the "1".
vector<string> tokenizer::to_vector_int(const string s)
{
regex re("(\\d*)");
vector<string> vs;
cmatch matches;
char * loc = const_cast<char *>(s.c_str());
while( regex_search(loc, matches, re) ) {
vs.push_back(string(matches[0].first, matches[0].second));
loc = const_cast<char *>(matches.suffix().str().c_str());
}
return vs;
}
In all honesty I don't think I have still understood the basics of searching for a pattern and getting the matches. Are there any tutorials with examples that explains this?
The basic problem is that you are using regex_match when you should be using regex_search:
The algorithms regex_search and
regex_match make use of match_results
to report what matched; the difference
between these algorithms is that
regex_match will only find matches
that consume all of the input text,
where as regex_search will search for
a match anywhere within the text being
matched.
From the boost documentation. Change it to use regex_search and it will work.
Also, it looks like you are not capturing the matches. Try changing the regex to this:
regex re("(\\d*)");
Or, maybe you need to be calling regex_search repeatedly:
char *where = s.c_str();
while (regex_search(s.c_str(), matches, re))
{
where = m.suffix().first;
}
This is since you only have one capture in your regex.
Alternatively, change your regex, if you know the basic structure of the data:
regex re("(\\d+).*?(\\d+)");
This would match two numbers within the search string.
Note that the regular expression \d* will match zero or more digits - this includes the empty string "" since this is exactly zero digits. I would change the expression to \d+ which will match 1 or more.

C++: Regex: returns full string and not matched group

for those asking, the {0} allows selection of any one block within the sResult string separated by the | 0 is the first block
it needs to be dynamic for future expansion as that number will be configurable by users
So I am working on a regex to extract 1 portion of a string, however while it matches the results return are not what is expected.
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern("^(?:[^|]+[|]){0}([^|;]+)");
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
for( int i = 0; i < regMatch.size(); i++)
{
//SUBMATCH 0 = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE"
//SUBMATCH 1 = "BUT|NOT|ANYTHNG|ELSE"
std::ssub_match sm = regMatch[i];
bValid = strcmp(regMatch[i].str().c_str(), pzPoint->_ptrTarget->_pzTag->szOPCItem);
}
}
For some reason I cannot figure out the code to get me just the MATCH_ME back so I can compare it to expected results list on the C++ side.
Anyone have any ideas on where I went wrong here.
It seems you're using regular expressions for what they haven't been designed for. You should first split your string at the delimiter | and apply regular expressions on the resulting tokens if you want to check them for validity.
By the way: The std::regex implementation in libstdc++ seems to be buggy. I just did some tests and found that even simple patterns containing escaped pipe characters like \\| failed to compile throwing a std::regex_error with no further information in the error message (GCC 4.8.1).
The following code example shows how to do what you are after - you compile this, then call it with a single numerical argument to extract that element of the input:
#include <iostream>
#include <cstring>
#include <regex>
int main(int argc, char *argv[]) {
char pat[100];
if (argc > 1) {
sprintf(pat, "^(?:[^|]+[|]){%s}([^|;]+)", argv[1]);
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern(pat);
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
std::ssub_match sm = regMatch[1];
std::cout << "The match is " << sm << std::endl;
//bValid = strcmp(regMatch[i].str().c_str(), pzPoint->_ptrTarget->_pzTag->szOPCItem);
}
}
return 0;
}
Creating an executable called match, you can then do
>> match 2
The match is NOT
which is what you wanted.
The regex, it turns out, works just fine - although as a matter of preference I would use \| instead of [|] for the first part.
Turns out the problem was on the C side in extracting the match, it had to be done more directly, below is the code that gets me exactly what I wanted out of the string so I can use it later.
std::string sResult = "MATCH_ME|BUT|NOT|ANYTHNG|ELSE";
std::regex pattern("^(?:[^|]+[|]){0}([^|;]+)");
std::smatch regMatch;
std::regex_search(sResult, regMatch, pattern);
if(regMatch[1].matched)
{
std::string theMatchedPortion = regMatch[1];
//the issue was not with the regex but in how I was retrieving the results.
//theMatchedPortion now equals "MATCH_ME" and by changing the number associated
with it I can navigate through the string
}

Boost regex not matching "\\s" to spaces

I'm just starting out with boost and c++ and I'm struggling to understand the behaviour of boost's regular expression engine when it comes to matching whitespace. If I use the code:
boost::regex rx(" ");
cout << regex_search(" ", rx);
to match spaces, then everything works as expected and the regex_search returns true. However, if I try replacing the regular expression with "\s" to match all whitespace characters, I never get a match, and the following code always outputs false:
boost::regex rx("\\s");
cout << regex_search(" ", rx);
What am I missing here?
As requested, here's my complete test case:
#include <boost/regex.hpp>
#include <iostream>
using namespace std;
int main()
{
boost::regex rx("\\s", boost::regex::icase);
cout << regex_search(" ", rx);
}
Got it -- I was originally using the prebuilt libraries from ascend4.org/Binary_installer_for_Boost_on_MinGW . After building Boost 1.52 the code works as expected. Trying to shortcut the boost build process ended up costing me a couple of hours of frustration... I've learnt my lesson now!

Getting sub-match_results with boost::regex

Hey, let's say I have this regex: (test[0-9])+
And that I match it against: test1test2test3test0
const bool ret = boost::regex_search(input, what, r);
for (size_t i = 0; i < what.size(); ++i)
cout << i << ':' << string(what[i]) << "\n";
Now, what[1] will be test0 (the last occurrence). Let's say that I need to get test1, 2 and 3 as well: what should I do?
Note: the real regex is extremely more complex and has to remain one overall match, so changing the example regex to (test[0-9]) won't work.
I think Dot Net has the ability to make single capture group Collections so that (grp)+ will create a collection object on group1. The boost engine's regex_search() is going to be just like any ordinary match function. You sit in a while() loop matching the pattern where the last match left off. The form you used does not use a bid-itterator, so the function won't start the next match where the last match left off.
You can use the itterator form:
(Edit - you can also use the token iterator, defining what groups to iterate over. Added in the code below).
#include <boost/regex.hpp>
#include <string>
#include <iostream>
using namespace std;
using namespace boost;
int main()
{
string input = "test1 ,, test2,, test3,, test0,,";
boost::regex r("(test[0-9])(?:$|[ ,]+)");
boost::smatch what;
std::string::const_iterator start = input.begin();
std::string::const_iterator end = input.end();
while (boost::regex_search(start, end, what, r))
{
string stest(what[1].first, what[1].second);
cout << stest << endl;
// Update the beginning of the range to the character
// following the whole match
start = what[0].second;
}
// Alternate method using token iterator
const int subs[] = {1}; // we just want to see group 1
boost::sregex_token_iterator i(input.begin(), input.end(), r, subs);
boost::sregex_token_iterator j;
while(i != j)
{
cout << *i++ << endl;
}
return 0;
}
Output:
test1
test2
test3
test0
Boost.Regex offers experimental support for exactly this feature (called repeated captures); however, since it's huge performance hit, this feature is disabled by default.
To enable repeated captures, you need to rebuild Boost.Regex and define macro BOOST_REGEX_MATCH_EXTRA in all translation units; the best way to do this is to uncomment this define in boost/regex/user.hpp (see the reference, it's at the very bottom of the page).
Once compiled with this define, you can use this feature by calling/using regex_search, regex_match and regex_iterator with match_extra flag.
Check reference to Boost.Regex for more info.
Seems to me like you need to create a regex_iterator, using the (test[0-9]) regex as input. Then you can use the resulting regex_iterator to enumerate the matching substrings of your original target.
If you still need "one overall match" then perhaps that work has to be decoupled from the task of finding matching substrings. Can you clarify that part of your requirement?