How Can I Use a Regex on the Reverse of a string? - c++

I want to use a regex on the reverse of a string.
I can do the following but all my sub_matches are reversed:
string foo("lorem ipsum");
match_results<string::reverse_iterator> sm;
if (regex_match(foo.rbegin(), foo.rend(), sm, regex("(\\w+)\\s+(\\w+)"))) {
cout << sm[1] << ' ' << sm[2] << endl;
}
else {
cout << "bad\n";
}
[Live example]
What I want is to get out:
ipsum lorem
Is there any provision for getting the sub-matches that are not reversed? That is, any provision beyond reversing the strings after they're matched like this:
string first(sm[1]);
string second(sm[2]);
reverse(first.begin(), first.end());
reverse(second.begin(), second.end());
cout << first << ' ' << second << endl;
EDIT:
It has been suggested that I update the question to clarify what I want:
Running the regex backwards on the string is not about reversing the order that the matches are found in. The regex is far more complex that would be valuable to post here, but running it backwards saves me from needing a look ahead. This question is about the handling of sub-matches obtained from a match_results<string::reverse_iterator>. I need to be able to get them out as they were in the input, here foo. I don't want to have to construct a temporary string and run reverse on it for each sub-match. How can I avoid doing this.

You could just reverse the order in which you use the results:
#include <regex>
#include <string>
#include <iostream>
using namespace std;
int main()
{
string foo("lorem ipsum");
smatch sm;
if (regex_match(foo, sm, regex("(\\w+)\\s+(\\w+)"))) {
cout << sm[2] << ' ' << sm[1] << endl; // use second as first
}
else {
cout << "bad\n";
}
}
Output:
ipsum lorem

This is absolutely possible! The key is in the fact that a sub_match inherits from pair<BidirIt, BidirIt>. Since sub_matches will be obtained from: match_results<string::reverse_iterator> sm, the elements of the pair a sub_match inherits from will be string::reverse_iterators.
So for any given sub_match from sm you can get the forward range from it's second.base() to it's first.base(). You don't have to construct strings to stream ranges but you will need to construct an ostream_iterator:
ostream_iterator<char> output(cout);
copy(sm[1].second.base(), sm[1].first.base(), output);
output = ' ';
copy(sm[2].second.base(), sm[2].first.base(), output);
Take heart though, there is a better solution on the horizon! This answer discusses string_literals as of right now no action has been taken on them, but they have made it into the "Evolution Subgroup".

Related

Is it possible to find two strings in one string using regular expressions? [duplicate]

I'm a bit confused about the following C++11 code:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string haystack("abcdefabcghiabc");
std::regex needle("abc");
std::smatch matches;
std::regex_search(haystack, matches, needle);
std::cout << matches.size() << std::endl;
}
I'd expect it to print out 3 but instead I get 1. Am I missing something?
You get 1 because regex_search returns only 1 match, and size() will return the number of capture groups + the whole match value.
Your matches is...:
Object of a match_results type (such as cmatch or smatch) that is filled by this function with information about the match results and any submatches found.
If [the regex search is] successful, it is not empty and contains a series of sub_match objects: the first sub_match element corresponds to the entire match, and, if the regex expression contained sub-expressions to be matched (i.e., parentheses-delimited groups), their corresponding sub-matches are stored as successive sub_match elements in the match_results object.
Here is a code that will find multiple matches:
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
while (regex_search(str, smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
str = smtch.suffix().str();
}
return 0;
}
See IDEONE demo returning abc 3 times.
As this method destroys the input string, here is another alternative based on the std::sregex_iterator (std::wsregex_iterator should be used when your subject is an std::wstring object):
int main() {
std::regex r("ab(c)");
std::string s = "abcdefabcghiabc";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
std::cout << " Capture: " << m[1].str() << " at Position " << m.position(1) << '\n';
}
return 0;
}
See IDEONE demo, returning
Match value: abc at Position 0
Capture: c at Position 2
Match value: abc at Position 6
Capture: c at Position 8
Match value: abc at Position 12
Capture: c at Position 14
What you're missing is that matches is populated with one entry for each capture group (including the entire matched substring as the 0th capture).
If you write
std::regex needle("a(b)c");
then you'll get matches.size()==2, with matches[0]=="abc", and matches[1]=="b".
EDIT: Some people have downvoted this answer. That may be for a variety of reasons, but if it is because it does not apply to the answer I criticized (no one left a comment to explain the decision), they should take note that W. Stribizew changed the code two months after I wrote this, and I was unaware of it until today, 2021-01-18. The rest of the answer is unchanged from when I first wrote it.
#stribizhev's solution has quadratic worst case complexity for sane regular expressions. For insane ones (e.g. "y*"), it doesn't terminate. In some applications, these issues could be DoS attacks waiting to happen. Here's a fixed version:
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
auto beg = str.cbegin();
while (regex_search(beg, str.cend(), smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
if ( smtch.length(0) > 0 )
std::advance(beg, smtch.length(0));
else if ( beg != str.cend() )
++beg;
else
break;
}
According to my personal preference, this will find n+1 matches of an empty regex in a string of length n. You might also just exit the loop after an empty match.
If you want to compare the performance for a string with millions of matches, add the following lines after the definition of str (and don't forget to turn on optimizations), once for each version:
for (int j = 0; j < 20; ++j)
str = str + str;

regex hanging my program

I'm trying to write a c++ regex to essentially match a few symbols and identifiers as part of a tokenizer. Currently, I have this:
EDITED
regex tokens("([a-zA-Z_][a-zA-Z0-9_]*)|(\\S?)|(\\S)")
vector<string> identifiers(std::sregex_token_iterator(str.begin(), str.end(),
IDENTIFIER),std::sregex_token_iterator());
https://regex101.com/r/mFTC1Y/2
The problem is, it hangs my program (just takes forever and I never get to the matches). I don't understand how that can be? The regex tester I'm using says it takes a bout 7ms to match...
Please help!
JUST EDITED: so this regex matches what I want, but only via group captures. If it parses:
main()
It will return
main( // full match
main // group 1
( // group 2
new match
) // full match
) // group 3
I just want the group matches without having to explicitly check the respective groups (i.e. I just don't return the full match to me). How can I update my code to do that?
EDIT
So, this is the full, working code. I'd prefer it be more elegant.
regex TOKENS("([a-zA-Z_][a-zA-Z0-9_]*)|(\\S?)|(\\S)")
auto identifier = sregex_iterator(str.cbegin(), str.cend(), TOKENS);
auto it = sregex_iterator();
for_each(identifier, it, [&](smatch const& m){
string group1(m[1].str());
string group2(m[2].str());
string group3(m[3].str());
if(isKeyword(keywords, group1)) cout << "<keyword> " << group1 << " </keyword>" << endl;
else if(group1 != "") cout << "<identifier> " << group1 << " </identifier>" << endl;
if (isSymbol(symbols, group2)) cout << "<symbol> " << group2 << " </symbol>" << endl;
if (isSymbol(symbols, group3)) cout << "<symbol> " << group3 << " </symbol>" << endl;
});
Something more elegant would probably come at the cost of a very complex regex, or else a very clever one, since essentially what I'm trying to do is tokenize code into one of three types: KEYWORD, ID and SYMBOL - all with one regex. Next I'll have to tackle INT/STRING const and comments. What I'm trying to avoid is tokenizing char by char, because then I'll have even more control-flow statements (which I don't want).
I am not sure, if your regex is correct.
Try the below:
#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <regex>
// Our test data (raw string). So, containing also \n and so on
std::string testData(
R"#( :-) IDcorrect1 _wrongID I2DCorrect
3FALSE lowercasecorrect Underscore_not_allowed
i3DCorrect,i4 :-)
}
)#");
std::regex re("(\\b[a-zA-Z][a-zA-Z0-9]*\\b)");
int main(void)
{
// Define the variable id as vector of string and use the range constructor to read the test data and tokenize it
std::vector<std::string> id{ std::sregex_token_iterator(testData.begin(), testData.end(), re, 1), std::sregex_token_iterator() };
// For debug output. Print complete vector to std::cout
std::copy(id.begin(), id.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
All IDs will be in the vector. Then you can further check.

removing multiple spaces in c++ from string

I have the following code to open a file and read the data from it, then take the relavent part and print it to screen.
char* search = "model name";
int Offset;
string Cpu;
ifstream CpuInfo;
CpuInfo.open ("/proc/cpuinfo");
if(CpuInfo.is_open())
{
while(!CpuInfo.eof())
{
getline(CpuInfo,Cpu);
if ((Offset = Cpu.find(search, 0)) != string::npos)
{
//cout << "found '" << search << " " << line << endl;
break;
}
}
CpuInfo.close();
}
Cpu.replace (0,13,"");
cout << Cpu
This usually outputs the type of CPU your using, but one problem is that some people have various spaces inbetween the words that it prints out.
My question is how to remove all the spaces from inbetween the words. They can of random ammount and aren't always present.
Thank you in advance.
Since your question states: "how to remove all the spaces from inbetween the words":
You can use std::remove_if from the standard <algorithm> library in addition to std::isspace:
std::string mystring = "Text with some spaces";
std::remove_if(mystring.begin(), mystring.end(), std::isspace);
This now becomes:
Textwithsomespaces
REFERENCES:
http://en.cppreference.com/w/cpp/algorithm/remove

How to use wildcard for strings (matching and replacing)?

I want to search for a number of letters including ? replaced by a letter matched in a string in C++.
Think of a word like abcdefgh. I want to find an algorithm to search for an input ?c for any letter replaced by ?, and finds bc, but also it should also check for ?e? and find def.
Do you have any ideas?
How about using boost::regex? or std::regex if you're using c++11 enabled compilers.
If you just want to support ?, that's pretty easy: when you encounter a ? in the pattern, just skip ahead over one byte of input (or check for isalpha, if you really meant you only want to match letters).
Edit: Assuming the more complex problem (finding a match starting at any position in the input string), you could use code something like this:
#include <string>
size_t match(std::string const &pat, std::string const &target) {
if (pat.size() > target.size())
return std::string::npos;
size_t max = target.size()-pat.size()+1;
for (size_t start =0; start < max; ++start) {
size_t pos;
for (pos=0; pos < pat.size(); ++pos)
if (pat[pos] != '?' && pat[pos] != target[start+pos])
break;
if (pos == pat.size())
return start;
}
return std::string::npos;
}
#ifdef TEST
#include <iostream>
int main() {
std::cout << match("??cd?", "aaaacdxyz") << "\n";
std::cout << match("?bc", "abc") << "\n";
std::cout << match("ab?", "abc") << "\n";
std::cout << match("ab?", "xabc") << "\n";
std::cout << match("?cd?", "cdx") << "\n";
std::cout << match("??cd?", "aaaacd") << "\n";
std::cout << match("??????", "abc") << "\n";
return 0;
}
#endif
If you only want to signal a yes/no based on whether the whole pattern matches the whole input, you do pretty much the same thing, but with the initial test for != instead of >, and then basically remove the outer loop.
Or if you insist on "wildcards" in the form you exhibit the term you want to search for is "glob"s (at least on unix-like systems).
The c-centric API is to be found in glob.h on unix-like systems, and consists of two calls glob and globfree in section 3 of the manual.
Switching to full regular expressions will allow you to use a more c++ approach as shown in the other answers.

boost regex sub-string match

I want to return output "match" if the pattern "regular" is a sub-string of variable st. Is this possible?
int main()
{
string st = "some regular expressions are Regxyzr";
boost::regex ex("[Rr]egular");
if (boost::regex_match(st, ex))
{
cout << "match" << endl;
}
else
{
cout << "not match" << endl;
}
}
The boost::regex_match only matches the whole string, you probably want boost::regex_search instead.
regex_search does what you want; regex_match is documented as
determines whether a given regular
expression matches all of a given
character sequence
(the emphasis is in the original URL I'm quoting from).
Your question is answered with example in library documentation - boost::regex
Alternate approach:
You can use boost::regex_iterator, this is useful for parsing file etc.
string[0],
string[1]
below indicates start and end iterator.
Ex:
boost::regex_iterator stIter(string[0], string[end], regExpression)
boost::regex_iterator endIter
for (stIter; stIter != endIter; ++stIter)
{
cout << " Whole string " << (*stIter)[0] << endl;
cout << " First sub-group " << (*stIter)[1] << endl;
}
}