Get every regex match one by one with their positions - c++

I need to get all regex matches and their positions.
For example, I have this regex:
std::regex r("(a)|(b)|(c)");
And this input text:
std::string text("abcab");
Now I want to loop the matches there in every loop I can access all occurrences from one match. So in first loop I could get "a" at position 0 and "a" at position 3. In second loop it'd be "b" at 1 and "b" at 4. And in third loop it'd be "c" at position 2. How can I do this?
Currently I have every regex part separately (regex for (a), (b) and (c)) and go through them one by one. But there are quite many of them so I'm looking for better/faster solution.

You can declare string vectors to store the captured values in, and then check which alternative branch matched, and add it to the corresponding vector.
Here is a C++ demo:
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex r("(a)|(b)|(c)");
std::string s = "abcab";
std::vector<std::string> astrings; // Declare the vectors to
std::vector<std::string> bstrings; // populate with the contents
std::vector<std::string> cstrings; // of capturing groups
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
if (m[1].matched) { // Check if Group 1 matched and
astrings.push_back(m[1].str()); // Put a value into a string vector
}
else if (m[2].matched) { // Check if Group 2 matched and
bstrings.push_back(m[2].str()); // Put a value into b string vector
}
else if (m[3].matched) { // Check if Group 3 matched and
cstrings.push_back(m[3].str()); // Put a value into c string vector
}
}
// Printing vectors - DEMO
for (auto i: astrings)
std::cout << i << ' ';
std::cout << "\n";
for (auto i: bstrings)
std::cout << i << ' ';
std::cout << "\n";
for (auto i: cstrings)
std::cout << i << ' ';
return 0;
}
You may also consider using std::regex_constants::optimize flag when declaring the regexp (see Galik's comment).

Related

Parsing doubles and words in a string

I was working on the following exercise from The C++ Programming Language:
Read a sequence of possibly whitespace-separated (name,value) pairs,
where the name is a single whitespace-separated word and the value is
an integer or a floating-point value. Compute and print the sum and
mean for each name and the sum and mean for all names.
For example, given:
hello world5.678popcorn 8.123 rock 123 hello world 8.761 popcorn 98 rock 1.9rock2.3
The output of my implementation is:
rock: Sum (127.2), Mean (42.4)
hello world: Sum (14.439), Mean (7.2195)
popcorn: Sum (106.123), Mean (53.0615)
My implementation:
#include <iostream>
#include <string>
#include <unordered_map>
std::unordered_map<std::string, double> pairs;
std::unordered_map<std::string, int> occurences;
void save(const std::string& name, const std::string& value);
void trim(std::string& s, const std::string& chars = " ");
int main() {
std::string line;
getline(std::cin, line);
std::string name;
std::string value;
bool name_saved;
for(char c : line) {
if(!name_saved && isdigit(c)) { // reached end of name
name_saved = true;
trim(name);
value += c;
} else if(!name_saved) { // add char to name
name += c;
} else if(name_saved) {
if(isdigit(c) || (c == '.' && (value.find_first_of(".") == std::string::npos))) { // add char to value
value += c;
} else { // reached end of value
trim(value);
save(name, value);
name = "";
value = "";
name_saved = false;
if(isalpha(c)) {
name += c;
}
}
}
}
if(value != "") {
save(name, value);
}
std::cout << "\n";
for(auto pair : pairs) {
std::cout << pair.first << ": " << "Sum (" << pair.second << "), Mean (" << (pair.second / occurences[pair.first]) << ")\n";
}
return 0;
}
void save(const std::string& name, const std::string& value) {
pairs[name] += std::stod(value);
occurences[name]++;
}
void trim(std::string& s, const std::string& chars) {
s.erase(0, s.find_first_not_of(chars));
s.erase(s.find_last_not_of(chars) + 1);
}
I was wondering what would a more efficient approach to this exercise be? I feel that my code is quite messy and I would like to get some input on what I could use to clean it up and make it more compact.
There are many many different soultions. It depends a little bit on your personal programming style and what you have learned already or not.
The above example cries for "regular expressions". You can read here in the Cpp- Reference about them. And especially the function std::regex_search will be your friend.
First to the regurlar expression. You are looking for a "text" with embedded spaces ("single whitespace-separated word "), followed by a int or float. This can be expressed easily with a "regex":
([a-zA-Z]+ ?[a-zA-Z]+) ?(\d+\.?\d*)
So, first we have 1 or more alpha-characters, then an optional space and then again 1 or more alpha-characters. This makes up the name.
For the value, we have 1 or more digits, maybe followed by a "." and maybe more digits.
You get a better understanding, if you paste the regex and the test string in some online regex-tester like this.
It will give you some more detailed explanation. Espcially for the brackets "()", which form groups.
An the groups can be extracted after a std::regex_search has found a match. Meaning: If std::regex_search will find a match for the given regex, it will return true and the resulting groups can be found in std::smatch, see here for a description.
And with all this, we can define a simple for loop to get all names and values from the test string.
for (std::string s{ test }; std::regex_search(s, sm, re); s = sm.suffix())
First we will initialize the "loop-run" variable, in this case a std::string and initialize it with the given test-string. Then we will search for a match in the test-string. If there was a match, then we will find the result in sm[1] and sm[2]. After we did all operations inside the loop body, we set the "loop-run" variable to the not yet matched rest of the test string. The suffix.
To goup, calculate and aggregate the values, we use a std::map. The key is the "name" and the value is a std::pair consisting of the count of names and the sum of the associated values. So, in the loop:
for (std::string s{ test }; std::regex_search(s, sm, re); s = sm.suffix()) {
// Count the occurences of a text
aggregator[sm[1]].first++;
// Sum up the values for a text
aggregator[sm[1]].second += std::stod(sm[2]);
}
we use aggregator[sm[1]] to create the name in or retrieve the name from map. In any case, we have then a reference to the current name entry, and we can increment the count and build the running sum.
This is a very simple 3 line approach, which does already nearly all the expected work.
The rest is simple calculation of expected sum and mean values and showing everything on the screen.
Please see the full code below:
#include <iostream>
#include <string>
#include <regex>
#include <vector>
#include <iterator>
#include <map>
// The regex for words with embedded space and floats/ints
const std::regex re{ R"(([a-zA-Z]+ ?[a-zA-Z]+) ?(\d+\.?\d*))" };
int main() {
// Definition Section --------------------------------------------------------------------------------
// The input test string
std::string test{"hello world5.678popcorn 8.123 rock 123 hello world 8.761 popcorn 98 rock 1.9rock2.3"};
// Here we will store the result. The text and the associated "count" and "sum"
std::map<std::string, std::pair<unsigned int, double>> aggregator{};
std::smatch sm;
// Find, store and calculate data -------------------------------------------------------------------
// Iterate though the string and get the text and the float value
for (std::string s{ test }; std::regex_search(s, sm, re); s = sm.suffix()) {
// Count the occurences of a text
aggregator[sm[1]].first++;
// Sum up the values for a text
aggregator[sm[1]].second += std::stod(sm[2]);
}
// Output data ----------------------------------------------------------------------------------------
// Since the task is to calculate also the overall results, we will do
unsigned int countOverall{};
double sumOverall{};
// Iterate over the "text" data and output sum and mean value per text and aggregate the overall values
for (const auto& [text, agg] : aggregator) {
// Output sum and mean per text
std::cout << "\n" << text << ": Sum (" << agg.second << "), Mean (" << agg.second / agg.first << ")";
// Aggregate overall values
countOverall += agg.first;
sumOverall += agg.second;
}
// Show overall result to the user.
std::cout << "\n\nSum overall: (" << sumOverall << "), Mean overall: (" << sumOverall / countOverall << ")\n\n";
return 0;
}
If this solution is "better" or not? Please decide yourself . . .

Regex - count all numbers

I'm looking for a regex pattern that returns true if found 7 numbers on given string. There's no order so if a string is set to: "100 my, str1ng y000" it catches that.
RegEx alone won't count exact occurrences for you, it would return true even if there are more than 7 digits in the string because it would try to find out at least 7 digits in the string.
You can use below code to test exact number (7 in your case) of digits in any string:
var temp = "100 my, str1ng y000 3c43fdgd";
var count = (temp.match(/\d/g) || []).length;
alert(count == 7);
I will show you an C++ Example that
Shows a regex for extracting digit groups
Shows a regex for matching at least 7 digits
Shows, if there is a match for the requested predicate
Shows the number of digits in the string (no regex needed)
Shows the group of digits
#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <regex>
// Our test data (raw string). So, containing also \" and so on
std::string testData("100 my, str1ng y000");
std::regex re1(R"#((\d+))#"); // For extracting digit groups
std::regex re2(R"#((\d.*){7,})#"); // For regex match
int main(void)
{
// Define the variable id as vector of string and use the range constructor to read the test data and tokenize it
std::vector<std::string> id{ std::sregex_token_iterator(testData.begin(), testData.end(), re1, 1), std::sregex_token_iterator() };
// Match the regex. Should have at least 7 digits somewhere
std::smatch base_match;
bool containsAtLeast7Digits = std::regex_match(testData, base_match, re2);
// Show result on screen
std::cout << "\nEvaluating string '" << testData <<
"'\n\nThe predicate 'contains-at-leats-7-digits' is " << std::boolalpha << containsAtLeast7Digits <<
"\n\nIt contains overall " <<
std::count_if(
testData.begin(),
testData.end(),
[](const char c) {
return std::isdigit(static_cast<int>(c));
}
) << " digits and " << id.size() << " digit groups. These are:\n\n";
// Print complete vector to std::cout
std::copy(id.begin(), id.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
Please note: Use std::count for counting. Faster and easier.
Hope this helps . . .

c++11 regexp retrieving all groups with +/* modifiers

I don't understand how to retrieve all groups using regexp in c++
An example:
const std::string s = "1,2,3,5";
std::regex lrx("^(\\d+)(,(\\d+))*$");
std::smatch match;
if (std::regex_search(s, match, lrx))
{
int i = 0;
for (auto m : match)
std::cout << " submatch " << i++ << ": "<< m << std::endl;
}
Gives me the result
submatch 0: 1,2,3,5
submatch 1: 1
submatch 2: ,5
submatch 3: 5
I am missing 2 and 3
You cannot use the current approach, since std::regex does not allow storing of the captured values in memory, each time a part of the string is captured, the former value in the group is re-written with the new one, and only the last value captured is available after a match is found and returned. And since you defined 3 capturing groups in the pattern, you have 3+1 groups in the output.
Mind also, that std::regex_search only returns one match, while you will need multiple matches here.
So, what you may do is to perform 2 steps: 1) validate the string using the pattern you have (no capturing is necessary here), 2) extract the digits (or split with a comma, that depends on the requirements).
A C++ demo:
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex rx_extract("[0-9]+");
std::regex rx_validate(R"(^\d+(?:,\d+)*$)");
std::string s = "1,2,3,5";
if (regex_match(s, rx_validate)) {
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), rx_extract);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m.str() << '\n';
}
}
return 0;
}
Output:
1
2
3
5

Is it possible to find two strings in one string using regular expressions? [duplicate]

I'm a bit confused about the following C++11 code:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string haystack("abcdefabcghiabc");
std::regex needle("abc");
std::smatch matches;
std::regex_search(haystack, matches, needle);
std::cout << matches.size() << std::endl;
}
I'd expect it to print out 3 but instead I get 1. Am I missing something?
You get 1 because regex_search returns only 1 match, and size() will return the number of capture groups + the whole match value.
Your matches is...:
Object of a match_results type (such as cmatch or smatch) that is filled by this function with information about the match results and any submatches found.
If [the regex search is] successful, it is not empty and contains a series of sub_match objects: the first sub_match element corresponds to the entire match, and, if the regex expression contained sub-expressions to be matched (i.e., parentheses-delimited groups), their corresponding sub-matches are stored as successive sub_match elements in the match_results object.
Here is a code that will find multiple matches:
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
while (regex_search(str, smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
str = smtch.suffix().str();
}
return 0;
}
See IDEONE demo returning abc 3 times.
As this method destroys the input string, here is another alternative based on the std::sregex_iterator (std::wsregex_iterator should be used when your subject is an std::wstring object):
int main() {
std::regex r("ab(c)");
std::string s = "abcdefabcghiabc";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
std::cout << " Capture: " << m[1].str() << " at Position " << m.position(1) << '\n';
}
return 0;
}
See IDEONE demo, returning
Match value: abc at Position 0
Capture: c at Position 2
Match value: abc at Position 6
Capture: c at Position 8
Match value: abc at Position 12
Capture: c at Position 14
What you're missing is that matches is populated with one entry for each capture group (including the entire matched substring as the 0th capture).
If you write
std::regex needle("a(b)c");
then you'll get matches.size()==2, with matches[0]=="abc", and matches[1]=="b".
EDIT: Some people have downvoted this answer. That may be for a variety of reasons, but if it is because it does not apply to the answer I criticized (no one left a comment to explain the decision), they should take note that W. Stribizew changed the code two months after I wrote this, and I was unaware of it until today, 2021-01-18. The rest of the answer is unchanged from when I first wrote it.
#stribizhev's solution has quadratic worst case complexity for sane regular expressions. For insane ones (e.g. "y*"), it doesn't terminate. In some applications, these issues could be DoS attacks waiting to happen. Here's a fixed version:
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
auto beg = str.cbegin();
while (regex_search(beg, str.cend(), smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
if ( smtch.length(0) > 0 )
std::advance(beg, smtch.length(0));
else if ( beg != str.cend() )
++beg;
else
break;
}
According to my personal preference, this will find n+1 matches of an empty regex in a string of length n. You might also just exit the loop after an empty match.
If you want to compare the performance for a string with millions of matches, add the following lines after the definition of str (and don't forget to turn on optimizations), once for each version:
for (int j = 0; j < 20; ++j)
str = str + str;

c++11 regex : check if a set of characters exist in a string

If for example, I have the string: "asdf{ asdf }",
I want to check if the string contains any character in the set []{}().
How would I go about doing this?
I'm looking for a general solution that checks if the string has the characters in the set, so that I can continue to add lookup characters in the set in the future.
Your question is unclear on whether you only want to detect if any of the characters in the search set are present in the input string, or whether you want to find all matches.
In either case, use std::regex to create the regular expression object. Because all the characters in your search set have special meanings in regular expressions, you'll need to escape all of them.
std::regex r{R"([\[\]\{\}\(\)])"};
char const *str = "asdf{ asdf }";
If you want to only detect whether at least one match was found, use std::regex_search.
std::cmatch results;
if(std::regex_search(str, results, r)) {
std::cout << "match found\n";
}
On the other hand, if you want to find all the matches, use std::regex_iterator.
std::cmatch results;
auto first = std::cregex_iterator(str, str + std::strlen(str), r);
auto last = std::cregex_iterator();
if(first != last) std::cout << "match found\n";
while(first != last) {
std::cout << (*first++).str() << '\n';
}
Live demo
I know you are asking about regex but this specific problem can be solved without it using std::string::find_first_of() which finds the position of the first character in the string(s) that is contained in a set (g):
#include <string>
#include <iostream>
int main()
{
std::string s = "asdf{ asdf }";
std::string g = "[]{}()";
// Does the string contain one of thecharacters?
if(s.find_first_of(g) != std::string::npos)
std::cout << s << " contains one of " << g << '\n';
// find the position of each occurence of the characters in the string
for(size_t pos = 0; (pos = s.find_first_of(g, pos)) != std::string::npos; ++pos)
std::cout << s << " contains " << s[pos] << " at " << pos << '\n';
}
OUTPUT:
asdf{ asdf } contains one of []{}()
asdf{ asdf } contains { at 4
asdf{ asdf } contains } at 11