I am new to c++ and would like to know how to extract multiple substrings, from a single string, in between the same delimiters?
ex.
"{("id":"4219","firstname":"Paul"),("id":"4349","firstname":"Joe"),("id":"4829","firstname":"Brandy")}"
I want the ids:
4219 , 4349 , 4829
You can use regex to match the ids:
#include <iostream>
#include <regex>
int main() {
// This is your string.
std::string s{ R"({("id":"4219","firstname":"Paul"),("id":"4349","firstname":"Joe"),"("id":"4829","firstname":"Brandy")})"};
// Matches "id":"<any number of digits>"
// The id will be captured in the first group
std::regex r(R"("id"\s*:\s*"(\d+))");
// Make iterators that perform the matching
auto ids_begin = std::sregex_iterator(s.begin(), s.end(), r);
auto ids_end = std::sregex_iterator();
// Iterate the matches and print the first group of each of them
// (where the id is captured)
for (auto it = ids_begin; it != ids_end; ++it) {
std::smatch match = *it;
std::cout << match[1].str() << ',';
}
}
See it live on Coliru
Well, here is the q&d hack:
#include <iostream>
#include <sstream>
#include <string>
int main()
{
std::string s{ "{(\"id\":\"4219\",\"firstname\":\"Paul\"),"
"(\"id\":\"4349\",\"firstname\":\"Joe\"),"
"(\"id\":\"4829\",\"firstname\":\"Brandy\")}"
};
std::string id{ "\"id\":\"" };
for (auto f = s.find("\"id\":\""); f != s.npos; f = s.find(id, f)) {
std::istringstream iss{ std::string{ s.begin() + (f += id.length()), s.end() } };
int id; iss >> id;
std::cout << id << '\n';
}
}
Reliable? Well, just hope nobody names children "id":" ...
Related
How to extract Test and Again from string s in below code.
Currently I am using regex_iterator and it doesn't seems to be matching groups in regular expression and I am getting {{Test}} and {{Again}} in output.
#include <regex>
#include <iostream>
int main()
{
const std::string s = "<abc>{{Test}}</abc><def>{{Again}}</def>";
std::regex rgx("\\{\\{(\\w+)\\}\\}");
std::smatch match;
std::sregex_iterator next(s.begin(), s.end(), rgx);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str() << "\n";
next++;
}
return 0;
}
I also tried using regex_search but it is not working with multiple patterns and only giving Test ouput
#include <regex>
#include <iostream>
int main()
{
const std::string s = "<abc>{{Test}}</abc><def>{{Again}}</def>";
std::regex rgx("\\{\\{(\\w+)\\}\\}");
std::smatch match;
if (std::regex_search(s, match, rgx,std::regex_constants::match_any))
{
std::cout<<"Match size is "<<match.size()<<std::endl;
for(auto elem:match)
std::cout << "match: " << elem << '\n';
}
}
Also as a side note why two backslashes are needed to escape { or }
To access the contents of the capturing group you need to use .str(1):
std::cout << match.str(1) << std::endl;
See the C++ demo:
#include <regex>
#include <iostream>
int main()
{
const std::string s = "<abc>{{Test}}</abc><def>{{Again}}</def>";
// std::regex rgx("\\{\\{(\\w+)\\}\\}");
// Better, use a raw string literal:
std::regex rgx(R"(\{\{(\w+)\}\})");
std::smatch match;
std::sregex_iterator next(s.begin(), s.end(), rgx);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str(1) << std::endl;
next++;
}
return 0;
}
Output:
Test
Again
Note you do not have to use double backslashes to define a regex escape sequence inside raw string literals (here, R"(pattern_here)").
I'm a little poor with regular expressions so I would appreciate help if someone can tell me what the right regular expression would be to capture the three elements that are in this format -
<element1>[<element2>="<element3>"]
I could use boost if needed. The delimiters in this string are '[', '=', ']', '"' and ' '.
Update: This is what I tried till now -
int main(void) {
std::string subject("foo[bar=\"baz\"]");
try {
std::regex re("([a-zA-Z]+)[([a-zA-Z])=");
std::sregex_iterator next(subject.begin(), subject.end(), re);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str() << std::endl;
next++;
}
} catch (std::regex_error& e) {
std::cout << "Error!" << std::endl;
}
}
Though this give me -
foo[
bar
baz
Thanks
You don't need iterators for this, you can match it all in one expression with capture groups (<capture>) that return sub matches like this:
// Note: Raw string literal R"~()~" removes the need to escape the string
std::regex const e{R"~(([^[]+)\[([^=]+)="([^"]+)"\])~"};
// ^ 1 ^ ^ 2 ^ ^ 3 ^
// | | | | |_____|------- sub_match #3
// | | | |
// | | |_____|---------------- sub_match #2
// | |
// |_____|------------------------- sub_match #1
std::string s(R"~(foo[bar="baz"])~"); // Raw string literal again
std::smatch m;
if(std::regex_match(s, m, e))
{
std::cout << m[1] << '\n'; // sub_match #1
std::cout << m[2] << '\n'; // sub_match #2
std::cout << m[3] << '\n'; // sub_match #3
}
You could use \[<\[" \]?(\[^<>\[\]" =\x0a\x0d\]+)\[>\[" \]? to get the elements:
#include <string>
#include <sstream>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
#include <iomanip>
auto input_text{
R"(foo[bar="baz"]
<element1>[<element2>="<element3>"])"};
auto fromString(std::string str) {
std::vector<std::string> elements;
std::regex r{R"([<\[" ]?([^<>\[\]" =\x0a\x0d]+)[>\[" ]?)"};
std::istringstream iss(str);
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
auto element = match[1].str();
elements.push_back(element);
}
return elements;
}
int main()
{
auto result = fromString(input_text);
for (auto t : result) {
std::cout << t << '\n';
}
return 0;
}
Output:
foo
bar
baz
element1
element2
element3
Live demo
NOTE : I'm using C++14 flag to compile... I am trying to create a very simple lexer in C++. I am using regular expressions to identify different tokens . My program is able to identify the tokens and display them. BUT THE out is of the form
int
main
hello
2
*
3
+
return
I want the output to be in the form
int IDENTIFIER
hello IDENTIFIER
* OPERATOR
3 NUMBER
so on...........
I am not able to achieve the above output.
Here is my program:
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <map>
using namespace std;
int main()
{
string str = " hello how are 2 * 3 you? 123 4567867*98";
// define list of token patterns
map<string, string> v
{
{"[0-9]+" , "NUMBERS"} ,
{"[a-z]+" , "IDENTIFIERS"},
{"[\\*|\\+", "OPERATORS"}
};
// build the final regex
string reg = "";
for(auto it = v.begin(); it != v.end(); it++)
reg = reg + it->first + "|";
// remove extra trailing "|" from above instance of reg..
reg.pop_back();
cout << reg << endl;
regex re(reg);
auto words_begin = sregex_iterator(str.begin(), str.end(), re);
auto words_end = sregex_iterator();
for(sregex_iterator i = words_begin; i != words_end; i++)
{
smatch match = *i;
string match_str = match.str();
cout << match_str << "\t" << endl;
}
return 0;
}
what is the most optimal way of doing it and also maintain the order of tokens as they appear in the source program?
I managed to do this with only one iteration over the parsed string. All you have to do is add parentheses around regex for each token type, then you'll be able to access the strings of these submatches. If you get a non-empty string for a submatch, that means it was matched. You know the index of the submatch and therefore the index in v.
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <vector>
int main()
{
std::string str = " hello how are 2 * 3 you? 123 4567867*98";
// use std::vector instead, we need to have it in this order
std::vector<std::pair<std::string, std::string>> v
{
{"[0-9]+" , "NUMBERS"} ,
{"[a-z]+" , "IDENTIFIERS"},
{"\\*|\\+", "OPERATORS"}
};
std::string reg;
for(auto const& x : v)
reg += "(" + x.first + ")|"; // parenthesize the submatches
reg.pop_back();
std::cout << reg << std::endl;
std::regex re(reg, std::regex::extended); // std::regex::extended for longest match
auto words_begin = std::sregex_iterator(str.begin(), str.end(), re);
auto words_end = std::sregex_iterator();
for(auto it = words_begin; it != words_end; ++it)
{
size_t index = 0;
for( ; index < it->size(); ++index)
if(!it->str(index + 1).empty()) // determine which submatch was matched
break;
std::cout << it->str() << "\t" << v[index].second << std::endl;
}
return 0;
}
std::regex re(reg, std::regex::extended); is for matching for the longest string which is necessary for a lexical analyzer. Otherwise it might identify while1213 as while and number 1213 and depends on the order you define for the regex.
This is a quick and dirty solution iterating on each pattern, and for each pattern trying to match the entire string, then iterating over matches and storing each match with its position in a map. The map implicitly sorts the matches by key (position) for you, so then you just need to iterate the map to get the matches in positional order, regardless of their pattern name.
#include <iterator>
#include <iostream>
#include <string>
#include <regex>
#include <list>
#include <map>
using namespace std;
int main(){
string str = " hello how are 2 * 3 you? 123 4567867*98";
// define list of patterns
map<string,string> patterns {
{ "[0-9]+" , "NUMBERS" },
{ "[a-z]+" , "IDENTIFIERS" },
{ "\\*|\\+", "OPERATORS" }
};
// storage for results
map< size_t, pair<string,string> > matches;
for ( auto pat = patterns.begin(); pat != patterns.end(); ++pat )
{
regex r(pat->first);
auto words_begin = sregex_iterator( str.begin(), str.end(), r );
auto words_end = sregex_iterator();
for ( auto it = words_begin; it != words_end; ++it )
matches[ it->position() ] = make_pair( it->str(), pat->second );
}
for ( auto match = matches.begin(); match != matches.end(); ++match )
cout<< match->second.first << " " << match->second.second << endl;
}
Output:
hello IDENTIFIERS
how IDENTIFIERS
are IDENTIFIERS
2 NUMBERS
* OPERATORS
3 NUMBERS
you IDENTIFIERS
123 NUMBERS
4567867 NUMBERS
* OPERATORS
98 NUMBERS
I would like to split a string like this one
“this1245is#g$0,therhsuidthing345”
using a list of words like the one bellow
{“this”, “is”, “the”, “thing”}
into this list
{“this”, “1245”, “is”, “#g$0,”, “the”, “rhsuid”, “thing”, “345”}
// ^--------------^---------------^------------------^-- these were the delimiters
The delimiters are allowed to appear more than once in the string to split, and it can be done using regular expressions
The precedence is in the order in which the delimiters appear in the array
The platform I'm developing for has no support for the Boost library
Update
This is what I have for the moment
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this1245is#g$0,therhsuidthing345");
std::string delimiters[] = {"this", "is", "the", "thing"};
for (int i=0; i<4; i++) {
std::string delimiter = "(" + delimiters[i] + ")(.*)";
std::regex e (delimiter); // matches words beginning by the i-th delimiter
// default constructor = end-of-sequence:
std::sregex_token_iterator rend;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::sregex_token_iterator c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
}
return 0;
}
output:
1st and 2nd submatches:[this][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[is][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[the][rhsuidthing345]
1st and 2nd submatches:[thing][345]
I think I need to make some recursive thing to call on each iteration
Build the expression you want for matches only (re), then pass in {-1, 0} to your std::sregex_token_iterator to return all non-matches (-1) and matches (0).
#include <iostream>
#include <regex>
int main() {
std::string s("this1245is#g$0,therhsuidthing345");
std::regex re("(this|is|the|thing)");
std::sregex_token_iterator iter(s.begin(), s.end(), re, { -1, 0 });
std::sregex_token_iterator end;
while (iter != end) {
//Works in vc13, clang requires you increment separately,
//haven't gone into implementation to see if/how ssub_match is affected.
//Workaround: increment separately.
//std::cout << "[" << *iter++ << "] ";
std::cout << "[" << *iter << "] ";
++iter;
}
}
I don't know how to perform the precedence requirement. This seems to work on the given input:
std::vector<std::string> parse (std::string s)
{
std::vector<std::string> out;
std::regex re("\(this|is|the|thing).*");
std::string word;
auto i = s.begin();
while (i != s.end()) {
std::match_results<std::string::iterator> m;
if (std::regex_match(i, s.end(), m, re)) {
if (!word.empty()) {
out.push_back(word);
word.clear();
}
out.push_back(std::string(m[1].first, m[1].second));
i += out.back().size();
} else {
word += *i++;
}
}
if (!word.empty()) {
out.push_back(word);
}
return out;
}
vector<string> strs;
boost::split(strs,line,boost::is_space());
I want to extract only those words within double quotes. So, if the content is:
Would "you" like to have responses to your "questions" sent to you via email?
The answer must be
1- you
2- questions
std::string str("test \"me too\" and \"I\" did it");
std::regex rgx("\"([^\"]*)\""); // will capture "me too"
std::regex_iterator current(str.begin(), str.end(), rgx);
std::regex_iterator end;
while (current != end)
std::cout << *current++;
If you really want to use Regex, you can do it like so:
#include <regex>
#include <sstream>
#include <vector>
#include <iostream>
int main() {
std::string str = R"d(Would "you" like to have responses to your "questions" sent to you via email?)d";
std::regex rgx(R"(\"(\w+)\")");
std::smatch match;
std::string buffer;
std::stringstream ss(str);
std::vector<std::string> strings;
//Split by whitespaces..
while(ss >> buffer)
strings.push_back(buffer);
for(auto& i : strings) {
if(std::regex_match(i,match, rgx)) {
std::ssub_match submatch = match[1];
std::cout << submatch.str() << '\n';
}
}
}
I think only MSVC and Clang supposedly support though, otherwise you can use boost.regex like so.
Use the split() function from this answer then extract odd-numbered items:
std::vector<std::string> itms = split("would \"you\" like \"questions\"?", '"');
for (std::vector<std::string>::iterator it = itms.begin() + 1; it != itms.end(); it += 2) {
std::cout << *it << endl;
}