C++11 regex matching capturing group multiple times - c++

Could someone please help me to extract the text between the : and the ^ symbols using a JavaScript (ECMAScript) regular expression in C++11. I do not need to capture the hw-descriptor itself - but it does have to be present in the line in order for the rest of the line to be considered for a match. Also the :p....^, :m....^ and :u....^ can arrive in any order and there has to be at least 1 present.
I tried using the following regular expression:
static const std::regex gRegex("(?:hw-descriptor)(:[pmu](.*?)\\^)+", std::regex::icase);
against the following text line:
"hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^"
Here is the code which posted on a live coliru. It shows how I attempted to solve this problem, however I am only getting 1 match. I need to see how to extract each of the potential 3 matches corresponding to the p m or u characters described earlier.
#include <iostream>
#include <string>
#include <vector>
#include <regex>
int main()
{
static const std::regex gRegex("(?:hw-descriptor)(:[pmu](.*?)\\^)+", std::regex::icase);
std::string foo = "hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^";
// I seem to only get 1 match here, I was expecting
// to loop through each of the matches, looks like I need something like
// a pcre global option but I don't know how.
std::for_each(std::sregex_iterator(foo.cbegin(), foo.cend(), gRegex), std::sregex_iterator(),
[&](const auto& rMatch) {
for (int i=0; i< static_cast<int>(rMatch.size()); ++i) {
std::cout << rMatch[i] << std::endl;
}
});
}
The above program gives the following output:
g++ -std=c++14 -O2 -Wall -pedantic -pthread main.cpp && ./a.out
hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^
:uTEXT3^
TEXT3

With std::regex, you cannot keep mutliple repeated captures when matching a certain string with consecutive repeated patterns.
What you may do is to match the overall texts containing the prefix and the repeated chunks, capture the latter into a separate group, and then use a second smaller regex to grab all the occurrences of the substrings you want separately.
The first regex here may be
hw-descriptor((?::[pmu][^^]*\\^)+)
See the online demo. It will match hw-descriptor and ((?::[pmu][^^]*\\^)+) will capture into Group 1 one or more repetitions of :[pmu][^^]*\^ pattern: :, p/m/u, 0 or more chars other than ^ and then ^. Upon finding a match, use :[pmu][^^]*\^ regex to return all the real "matches".
C++ demo:
static const std::regex gRegex("hw-descriptor((?::[pmu][^^]*\\^)+)", std::regex::icase);
static const std::regex lRegex(":[pmu][^^]*\\^", std::regex::icase);
std::string foo = "hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^ hw-descriptor:pTEXT8^:mTEXT8^:uTEXT83^";
std::smatch smtch;
for(std::sregex_iterator i = std::sregex_iterator(foo.begin(), foo.end(), gRegex);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << std::endl;
std::string x = m.str(1);
for(std::sregex_iterator j = std::sregex_iterator(x.begin(), x.end(), lRegex);
j != std::sregex_iterator();
++j)
{
std::cout << "Element value: " << (*j).str() << std::endl;
}
}
Output:
Match value: hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^
Element value: :pTEXT1^
Element value: :mTEXT2^
Element value: :uTEXT3^
Match value: hw-descriptor:pTEXT8^:mTEXT8^:uTEXT83^
Element value: :pTEXT8^
Element value: :mTEXT8^
Element value: :uTEXT83^

Related

c++11 (MSVS2012) regex looking for file names in multiple line std::string

I have been trying to search for a clear answer on this one, but not been able to find it.
So lets say I have the string (where \n could be \r\n - I want to handle both - not sure if that is relevant or not)
"4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54"
Then I want to get matches:
a_file_123.xml
a_file_j34.xml
Here is my test code:
const str::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
std::smatch matches;
if (std::regex_search(s, matches, std::regex("a_file_(.*)\\.xml")))
{
std::cout << "total: " << matches.size() << std::endl;
for (unsigned int i = 0; i < matches.size(); i++)
{
std::cout << "match: " << matches[i] << std::endl;
}
}
Output is:
total: 2
match: a_file_123.xml
match: 123
I don't quite understand why match 2 is just "123"...
You only have one match, not two, as the regex_search method returns a single match. What you printed is two group values, Group 0 (the whole match, a_file_123.xml here) and Group 1 (the capturing group value, here, 123 that is a substring captured with a capturing group you defined as (.*) in the pattern).
If you want to match multiple strings, you need to use the regex iterator, not just a regex_search that only returns the first match.
Besides, .* is too greedy and will return weird results if you have more than 1 match on the same line. It seems you want to match letter or digits, so .* can be replaced with \w+. Well, if there can really be anything, just use .*?.
Use
const std::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
const std::regex rx("a_file_\\w+\\.xml");
std::vector<std::string> results(std::sregex_token_iterator(s.begin(), s.end(), rx),
std::sregex_token_iterator());
std::cout << "Number of matches: " << results.size() << std::endl;
for (auto result : results)
{
std::cout << result << std::endl;
}
See the C++ demo yielding
Number of matches: 2
a_file_123.xml
a_file_j34.xml
Notes on regex
a_file_ - a literal substring
\\w+ - 1+ word chars (letters, digits, _) (note you may use [^.]*? here instead of \\w+ if you want to match any char, 0 or more repetitions, as few as possible, up to the first .xml)
\\. - a dot (if you do not escape it, it will match any char except line break chars)
xml - a literal substring.
See the regex demo

C++ regex finds only 1 sub match [duplicate]

This question already has answers here:
How to match multiple results using std::regex
(6 answers)
Closed 5 years ago.
// Example program
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string strr("1.0.0.0029.443");
std::regex rgx("([0-9])");
std::smatch match;
if (std::regex_search(strr, match, rgx)) {
for(int i=0;i<match.size();i++)
std::cout << match[i] << std::endl;
}
}
this program should write
1
0
0
0
0
2
9
4
4
3
but it writes
1
1
checked it here http://cpp.sh/ and on visual studio, both same results.
Why does it find only 2 matches and why are they same?
As I understand from answers here, regex search stops at first match and match variable holds the necessary (sub?)string value to continue(by repeating) for other matches. Also since it stops at first match, () charachters are used only for sub-matches within the result.
Being called once, regex_search returns only the first match in the match variable. The collection in match comprises the match itself and capture groups if there are any.
In order to get all matches call regex_search in a loop:
while(regex_search(strr, match, rgx))
{
std::cout << match[0] << std::endl;
strr = match.suffix();
}
Note that in your case the first capture group is the same as the whole match so there is no need in the group and you may define the regex simply as [0-9] (without parentheses.)
Demo: https://ideone.com/pQ6IsO
Problems:
Using if only gives you one match. You need to use a while loop to find all the matches. You need to search past the previous match in the next iteration of the loop.
std::smatch::size() returns 1 + number of matches. See its documentation. std::smatch can contain sub-matches. To get the entire text, use match[0].
Here's an updated version of your program:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string strr("1.0.0.0029.443");
std::regex rgx("([0-9])");
std::smatch match;
while (std::regex_search(strr, match, rgx)) {
std::cout << match[0] << std::endl;
strr = match.suffix();
}
}

C++ Regex: non-greedy match

I'm currently trying to make a regex which matches URL parameters and extracts them.
For example, if I got the following parameters string ?param1=someValue&param2=someOtherValue, std::regex_match should extract the following contents:
param1
some_content
param2
some_other_content
After trying different regex patterns, I finally built one corresponding to what I want: std::regex("(?:[\\?&]([^=&]+)=([^=&]+))*").
If I take the previous example, std::regex_match matches as expected. However, it does not extract the expected values, keeping only the last captured values.
For example, the following code:
std::regex paramsRegex("(?:[\\?&]([^=&]+)=([^=&]+))*");
std::string arg = "?param1=someValue&param2=someOtherValue";
std::smatch sm;
std::regex_match(arg, sm, paramsRegex);
for (const auto &match : sm)
std::cout << match << std::endl;
will give the following output:
param2
someOtherValue
As you can see, param1 and its value are skipped and not captured.
After searching on google, I've found that this is due to greedy capture and I have modified my regex into "(?:[\\?&]([^=&]+)=([^=&]+))\\*?" in order to enable non-greedy capturing.
This regex works well when I try it on rubular but it does not match when I use it in C++ (std::regex_match returns false and nothing is captured).
I've tried different std::regex_constants options (different regex grammar by using std::regex_constants::grep, std::regex_constants::egrep, ...) but the result is the same.
Does someone know how to do non-greedy regex capture in C++?
As Casimir et Hippolyte explained in his comment, I just need to:
remove the quantifier
Use std::regex_iterator
It gives me the following code:
std::regex paramsRegex("[\\?&]([^=]+)=([^&]+)");
std::string url_params = "?key1=val1&key2=val2&key3=val3&key4=val4";
std::smatch sm;
auto params_it = std::sregex_iterator(url_params.cbegin(), url_params.cend(), paramsRegex);
auto params_end = std::sregex_iterator();
while (params_it != params_end) {
auto param = params_it->str();
std::regex_match(param, sm, paramsRegex);
for (const auto &s : sm)
std::cout << s << std::endl;
++params_it;
}
And here is the output:
?key1=val1
key1
val1
&key2=val2
key2
val2
&key3=val3
key3
val3
&key4=val4
key4
val4
The orignal regex (?:[\\?&]([^=&]+)=([^=&]+))* was just changed into [\\?&]([^=]+)=([^&]+).
Then, by using std::sregex_iterator, I get an iterator on each matching groups (?key1=val1, &key2=val2, ...).
Finally, by calling std::regex_match on each sub-string, I can retrieve parameters values.
Try to use match_results::prefix/suffix:
string match_expression("your expression");
smatch result;
regex fnd(match_expression, regex_constants::icase);
while (regex_search(in_str, result, fnd, std::regex_constants::match_any))
{
for (size_t i = 1; i < result.size(); i++)
{
std::cout << result[i].str();
}
in_str = result.suffix();
}

c++ regex substring wrong pattern found

I'm trying to understand the logic on the regex in c++
std::string s ("Ni Ni Ni NI");
std::regex e ("(Ni)");
std::smatch sm;
std::regex_search (s,sm,e);
std::cout << "string object with " << sm.size() << " matches\n";
This form shouldn't give me the number of substrings matching my pattern? Because it always give me 1 match and it says that the match is [Ni , Ni]; but i need it to find every single pattern; they should be 3 and like this [Ni][Ni][Ni]
The function std::regex_search only returns the results for the first match found in your string.
Here is a code, merged from yours and from cplusplus.com. The idea is to search for the first match, analyze it, and then start again using the rest of the string (that is to say, the sub-string that directly follows the match that was found, which can be retrieved thanks to match_results::suffix ).
Note that the regex has two capturing groups (Ni*) and ([^ ]*).
std::string s("the knights who say Niaaa and Niooo");
std::smatch m;
std::regex e("(Ni*)([^ ]*)");
while (std::regex_search(s, m, e))
{
for (auto x : m)
std::cout << x.str() << " ";
std::cout << std::endl;
s = m.suffix().str();
}
This gives the following output:
Niaaa Ni aaa
Niooo Ni ooo
As you can see, for every call to regex_search, we have the following information:
the content of the whole match,
the content of every capturing group.
Since we have two capturing groups, this gives us 3 strings for every regex_search.
EDIT: in your case if you want to retrieve every "Ni", all you need to do is to replace
std::regex e("(Ni*)([^ ]*)");
with
std::regex e("(Ni)");
You still need to iterate over your string, though.

Get String Between 2 Strings

How can I get a string that is between two other declared strings, for example:
String 1 = "[STRING1]"
String 2 = "[STRING2]"
Source:
"832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
How can I get the "I need this text here"?
Since this is homework, only clues:
Find index1 of occurrence of String1
Find index2 of occurrence of String2
Substring from index1+lengthOf(String1) (inclusive) to index2 (exclusive) is what you need
Copy this to a result buffer if necessary (don't forget to null-terminate)
Might be a good case for std::regex, which is part of C++11.
#include <iostream>
#include <string>
#include <regex>
int main()
{
using namespace std::string_literals;
auto start = "\\[STRING1\\]"s;
auto end = "\\[STRING2\\]"s;
std::regex base_regex(start + "(.*)" + end);
auto example = "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"s;
std::smatch base_match;
std::string matched;
if (std::regex_search(example, base_match, base_regex)) {
// The first sub_match is the whole string; the next
// sub_match is the first parenthesized expression.
if (base_match.size() == 2) {
matched = base_match[1].str();
}
}
std::cout << "example: \""<<example << "\"\n";
std::cout << "matched: \""<<matched << "\"\n";
}
Prints:
example: "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
matched: "I need this text here"
What I did was create a program that creates two strings, start and end that serve as my start and end matches. I then use a regular expression string that will look for those, and match against anything in-between (including nothing). Then I use regex_match to find the matching part of the expression, and set matched as the matched string.
For more info, see http://en.cppreference.com/w/cpp/regex and http://en.cppreference.com/w/cpp/regex/regex_search
Use strstr http://www.cplusplus.com/reference/clibrary/cstring/strstr/ , with that function you will get 2 pointers, now you should compare them (if pointer1 < pointer2) if so, read all chars between them.