C++ regular expression to match a string - c++

I'm a little poor with regular expressions so I would appreciate help if someone can tell me what the right regular expression would be to capture the three elements that are in this format -
<element1>[<element2>="<element3>"]
I could use boost if needed. The delimiters in this string are '[', '=', ']', '"' and ' '.
Update: This is what I tried till now -
int main(void) {
std::string subject("foo[bar=\"baz\"]");
try {
std::regex re("([a-zA-Z]+)[([a-zA-Z])=");
std::sregex_iterator next(subject.begin(), subject.end(), re);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str() << std::endl;
next++;
}
} catch (std::regex_error& e) {
std::cout << "Error!" << std::endl;
}
}
Though this give me -
foo[
bar
baz
Thanks

You don't need iterators for this, you can match it all in one expression with capture groups (<capture>) that return sub matches like this:
// Note: Raw string literal R"~()~" removes the need to escape the string
std::regex const e{R"~(([^[]+)\[([^=]+)="([^"]+)"\])~"};
// ^ 1 ^ ^ 2 ^ ^ 3 ^
// | | | | |_____|------- sub_match #3
// | | | |
// | | |_____|---------------- sub_match #2
// | |
// |_____|------------------------- sub_match #1
std::string s(R"~(foo[bar="baz"])~"); // Raw string literal again
std::smatch m;
if(std::regex_match(s, m, e))
{
std::cout << m[1] << '\n'; // sub_match #1
std::cout << m[2] << '\n'; // sub_match #2
std::cout << m[3] << '\n'; // sub_match #3
}

You could use \[<\[" \]?(\[^<>\[\]" =\x0a\x0d\]+)\[>\[" \]? to get the elements:
#include <string>
#include <sstream>
#include <vector>
#include <iterator>
#include <regex>
#include <iostream>
#include <iomanip>
auto input_text{
R"(foo[bar="baz"]
<element1>[<element2>="<element3>"])"};
auto fromString(std::string str) {
std::vector<std::string> elements;
std::regex r{R"([<\[" ]?([^<>\[\]" =\x0a\x0d]+)[>\[" ]?)"};
std::istringstream iss(str);
auto it = std::sregex_iterator(str.begin(), str.end(), r);
auto end = std::sregex_iterator();
for(; it != end; ++it) {
auto match = *it;
auto element = match[1].str();
elements.push_back(element);
}
return elements;
}
int main()
{
auto result = fromString(input_text);
for (auto t : result) {
std::cout << t << '\n';
}
return 0;
}
Output:
foo
bar
baz
element1
element2
element3
Live demo

Related

Parsing a string in c++ with a specfic format

I have this string post "ola tudo bem como esta" alghero.jpg and i want to break it into 3 pieces post, ola tudo bem como esta (i dont want the "") and alghero.jpg i tried it in c because im new and not really good at programming in c++ but its not working. Is there a more efficient way of doing this in c++?
Program:
int main()
{
char* token1 = new char[128];
char* token2 = new char[128];
char* token3 = new char[128];
char str[] = "post \"ola tudo bem como esta\" alghero.jpg";
char *token;
/* get the first token */
token = strtok(str, " ");
//walk through other tokens
while( token != NULL ) {
printf( " %s\n", token );
token = strtok(NULL, " ");
}
return(0);
}
In C++14 and later, you can use std::quoted to read quoted strings from any std::istream, such as std::istringstream, eg:
#include <iostream>
#include <sstream>
#include <string>
#include <iomanip>
int main()
{
std::string token1, token2, token3;
std::string str = "post \"ola tudo bem como esta\" alghero.jpg";
std::istringstream(str) >> token1 >> std::quoted(token2) >> token3;
std::cout << token1 << "\n";
std::cout << token2 << "\n";
std::cout << token3 << "\n";
return 0;
}
Use find to find the positions of the 2 quotes. Use substr to get the string from index 0 to first quote, first quote to second quote, and second quote to end.
std::string s = "post \"ola tudo bem como esta\" alghero.jpg";
auto first = s.find('\"');
if (first != s.npos) {
auto second = s.find('\"', first + 1);
if (second != s.npos) {
std::cout << s.substr(0, first-1) << '\n';
std::cout << s.substr(first+1, second-first-1) << '\n';
std::cout << s.substr(second+2) << '\n';
}
}
Output:
post
ola tudo bem como esta
alghero.jpg
One option for parsing strings is using regular expressions, for example :
#include <iostream>
#include <regex>
#include <string>
// struct to hold return value of parse function
struct parse_result_t
{
bool parsed{ false };
std::string token1;
std::string token2;
std::string token3;
};
// the parse function
auto parse(const std::string& string)
{
// this is a regex
// ^ match start of line
// (.*)\\\" matches any character until a \" (escaped ") and then escaped again for C++ string
// \w+ match one or more whitepsaces
// (.*)$ match 0 or more characters until end of string
// see it live here : https://regex101.com/r/XnkAZV/1
static std::regex rx{ "^(.*?)\\s+\\\"(.*?)\\\"\\s+(.*)$" };
std::smatch match;
parse_result_t result;
if (std::regex_search(string, match, rx))
{
result.parsed = true;
result.token1 = match[1];
result.token2 = match[2];
result.token3 = match[3];
}
return result;
}
int main()
{
auto result = parse("post \"ola tudo bem como esta\" alghero.jpg");
std::cout << "parse result = " << (result.parsed ? "success" : "failed") << "\n";
std::cout << "token 1 = " << result.token1 << "\n";
std::cout << "token 2 = " << result.token2 << "\n";
std::cout << "token 3 = " << result.token3 << "\n";
return 0;
}
if the strings are always separated by a single space you can just find the first space and last space using std::string::find and std::string::rfind`, split on those characters, and unquote the middle string:
#include <iostream>
#include <tuple>
#include <string>
std::string unquote(const std::string& str) {
if (str.front() != '"' || str.back() != '"') {
return str;
}
return str.substr(1, str.size() - 2);
}
std::tuple < std::string, std::string, std::string> parse_triple_with_quoted_middle(const std::string& str) {
auto iter1 = str.begin() + str.find(' ');
auto iter2 = str.begin() + str.rfind(' ');
auto str1 = std::string(str.begin(),iter1);
auto str2 = std::string(iter1 + 1, iter2);
auto str3 = std::string(iter2 + 1, str.end() );
return { str1, unquote(str2), str3 };
}
int main()
{
std::string test = "post \"ola tudo bem como esta\" alghero.jpg";
auto [str1, str2, str3] = parse_triple_with_quoted_middle(test);
std::cout << str1 << "\n";
std::cout << str2 << "\n";
std::cout << str3 << "\n";
}
You should probably put more input validation into the above, however.
You could use regular expressions for this:
The pattern to search repeatedly for would be: optionally starting with whitespaces \s*; then ([^\"]*) zero or more characters other than quotes (zero or more because you could have several quotes one after the other); and we capture this group (hence the use of parentheses); and finally, whether a quote \" or | the end of the expression $; and we don't capture this group (:?).We use std::regex to store the pattern, wrapping it all within R"()", so that we can write the raw expression.
The while loop does a few things: it searches the next match with regex_search, extracts the captured group, and updates the input line, so that the next search will start where the current one finished.matches is an array whose first element, matches[0], is the part of line matching the whole pattern, and the next elements correspond to the pattern's captured groups.
[Demo]
#include <iostream> // cout
#include <regex> // regex_search, smatch
int main() {
std::string line{"post \"ola tudo bem como esta\" alghero.jpg"};
std::regex pattern{R"(\s*([^\"]*)(:?\"|$))"};
std::smatch matches{};
while (std::regex_search(line, matches, pattern))
{
std::cout << matches[1] << "\n";
line = matches.suffix();
}
}

Multiple substrings in between the same delimiters

I am new to c++ and would like to know how to extract multiple substrings, from a single string, in between the same delimiters?
ex.
"{("id":"4219","firstname":"Paul"),("id":"4349","firstname":"Joe"),("id":"4829","firstname":"Brandy")}"
I want the ids:
4219 , 4349 , 4829
You can use regex to match the ids:
#include <iostream>
#include <regex>
int main() {
// This is your string.
std::string s{ R"({("id":"4219","firstname":"Paul"),("id":"4349","firstname":"Joe"),"("id":"4829","firstname":"Brandy")})"};
// Matches "id":"<any number of digits>"
// The id will be captured in the first group
std::regex r(R"("id"\s*:\s*"(\d+))");
// Make iterators that perform the matching
auto ids_begin = std::sregex_iterator(s.begin(), s.end(), r);
auto ids_end = std::sregex_iterator();
// Iterate the matches and print the first group of each of them
// (where the id is captured)
for (auto it = ids_begin; it != ids_end; ++it) {
std::smatch match = *it;
std::cout << match[1].str() << ',';
}
}
See it live on Coliru
Well, here is the q&d hack:
#include <iostream>
#include <sstream>
#include <string>
int main()
{
std::string s{ "{(\"id\":\"4219\",\"firstname\":\"Paul\"),"
"(\"id\":\"4349\",\"firstname\":\"Joe\"),"
"(\"id\":\"4829\",\"firstname\":\"Brandy\")}"
};
std::string id{ "\"id\":\"" };
for (auto f = s.find("\"id\":\""); f != s.npos; f = s.find(id, f)) {
std::istringstream iss{ std::string{ s.begin() + (f += id.length()), s.end() } };
int id; iss >> id;
std::cout << id << '\n';
}
}
Reliable? Well, just hope nobody names children "id":" ...

regex_iterator not matching groups in regular expression

How to extract Test and Again from string s in below code.
Currently I am using regex_iterator and it doesn't seems to be matching groups in regular expression and I am getting {{Test}} and {{Again}} in output.
#include <regex>
#include <iostream>
int main()
{
const std::string s = "<abc>{{Test}}</abc><def>{{Again}}</def>";
std::regex rgx("\\{\\{(\\w+)\\}\\}");
std::smatch match;
std::sregex_iterator next(s.begin(), s.end(), rgx);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str() << "\n";
next++;
}
return 0;
}
I also tried using regex_search but it is not working with multiple patterns and only giving Test ouput
#include <regex>
#include <iostream>
int main()
{
const std::string s = "<abc>{{Test}}</abc><def>{{Again}}</def>";
std::regex rgx("\\{\\{(\\w+)\\}\\}");
std::smatch match;
if (std::regex_search(s, match, rgx,std::regex_constants::match_any))
{
std::cout<<"Match size is "<<match.size()<<std::endl;
for(auto elem:match)
std::cout << "match: " << elem << '\n';
}
}
Also as a side note why two backslashes are needed to escape { or }
To access the contents of the capturing group you need to use .str(1):
std::cout << match.str(1) << std::endl;
See the C++ demo:
#include <regex>
#include <iostream>
int main()
{
const std::string s = "<abc>{{Test}}</abc><def>{{Again}}</def>";
// std::regex rgx("\\{\\{(\\w+)\\}\\}");
// Better, use a raw string literal:
std::regex rgx(R"(\{\{(\w+)\}\})");
std::smatch match;
std::sregex_iterator next(s.begin(), s.end(), rgx);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str(1) << std::endl;
next++;
}
return 0;
}
Output:
Test
Again
Note you do not have to use double backslashes to define a regex escape sequence inside raw string literals (here, R"(pattern_here)").

Writing a very simple lexical analyser in C++

NOTE : I'm using C++14 flag to compile... I am trying to create a very simple lexer in C++. I am using regular expressions to identify different tokens . My program is able to identify the tokens and display them. BUT THE out is of the form
int
main
hello
2
*
3
+
return
I want the output to be in the form
int IDENTIFIER
hello IDENTIFIER
* OPERATOR
3 NUMBER
so on...........
I am not able to achieve the above output.
Here is my program:
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <map>
using namespace std;
int main()
{
string str = " hello how are 2 * 3 you? 123 4567867*98";
// define list of token patterns
map<string, string> v
{
{"[0-9]+" , "NUMBERS"} ,
{"[a-z]+" , "IDENTIFIERS"},
{"[\\*|\\+", "OPERATORS"}
};
// build the final regex
string reg = "";
for(auto it = v.begin(); it != v.end(); it++)
reg = reg + it->first + "|";
// remove extra trailing "|" from above instance of reg..
reg.pop_back();
cout << reg << endl;
regex re(reg);
auto words_begin = sregex_iterator(str.begin(), str.end(), re);
auto words_end = sregex_iterator();
for(sregex_iterator i = words_begin; i != words_end; i++)
{
smatch match = *i;
string match_str = match.str();
cout << match_str << "\t" << endl;
}
return 0;
}
what is the most optimal way of doing it and also maintain the order of tokens as they appear in the source program?
I managed to do this with only one iteration over the parsed string. All you have to do is add parentheses around regex for each token type, then you'll be able to access the strings of these submatches. If you get a non-empty string for a submatch, that means it was matched. You know the index of the submatch and therefore the index in v.
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <vector>
int main()
{
std::string str = " hello how are 2 * 3 you? 123 4567867*98";
// use std::vector instead, we need to have it in this order
std::vector<std::pair<std::string, std::string>> v
{
{"[0-9]+" , "NUMBERS"} ,
{"[a-z]+" , "IDENTIFIERS"},
{"\\*|\\+", "OPERATORS"}
};
std::string reg;
for(auto const& x : v)
reg += "(" + x.first + ")|"; // parenthesize the submatches
reg.pop_back();
std::cout << reg << std::endl;
std::regex re(reg, std::regex::extended); // std::regex::extended for longest match
auto words_begin = std::sregex_iterator(str.begin(), str.end(), re);
auto words_end = std::sregex_iterator();
for(auto it = words_begin; it != words_end; ++it)
{
size_t index = 0;
for( ; index < it->size(); ++index)
if(!it->str(index + 1).empty()) // determine which submatch was matched
break;
std::cout << it->str() << "\t" << v[index].second << std::endl;
}
return 0;
}
std::regex re(reg, std::regex::extended); is for matching for the longest string which is necessary for a lexical analyzer. Otherwise it might identify while1213 as while and number 1213 and depends on the order you define for the regex.
This is a quick and dirty solution iterating on each pattern, and for each pattern trying to match the entire string, then iterating over matches and storing each match with its position in a map. The map implicitly sorts the matches by key (position) for you, so then you just need to iterate the map to get the matches in positional order, regardless of their pattern name.
#include <iterator>
#include <iostream>
#include <string>
#include <regex>
#include <list>
#include <map>
using namespace std;
int main(){
string str = " hello how are 2 * 3 you? 123 4567867*98";
// define list of patterns
map<string,string> patterns {
{ "[0-9]+" , "NUMBERS" },
{ "[a-z]+" , "IDENTIFIERS" },
{ "\\*|\\+", "OPERATORS" }
};
// storage for results
map< size_t, pair<string,string> > matches;
for ( auto pat = patterns.begin(); pat != patterns.end(); ++pat )
{
regex r(pat->first);
auto words_begin = sregex_iterator( str.begin(), str.end(), r );
auto words_end = sregex_iterator();
for ( auto it = words_begin; it != words_end; ++it )
matches[ it->position() ] = make_pair( it->str(), pat->second );
}
for ( auto match = matches.begin(); match != matches.end(); ++match )
cout<< match->second.first << " " << match->second.second << endl;
}
Output:
hello IDENTIFIERS
how IDENTIFIERS
are IDENTIFIERS
2 NUMBERS
* OPERATORS
3 NUMBERS
you IDENTIFIERS
123 NUMBERS
4567867 NUMBERS
* OPERATORS
98 NUMBERS

C++ split string using a list of words as separators

I would like to split a string like this one
“this1245is#g$0,therhsuidthing345”
using a list of words like the one bellow
{“this”, “is”, “the”, “thing”}
into this list
{“this”, “1245”, “is”, “#g$0,”, “the”, “rhsuid”, “thing”, “345”}
// ^--------------^---------------^------------------^-- these were the delimiters
The delimiters are allowed to appear more than once in the string to split, and it can be done using regular expressions
The precedence is in the order in which the delimiters appear in the array
The platform I'm developing for has no support for the Boost library
Update
This is what I have for the moment
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this1245is#g$0,therhsuidthing345");
std::string delimiters[] = {"this", "is", "the", "thing"};
for (int i=0; i<4; i++) {
std::string delimiter = "(" + delimiters[i] + ")(.*)";
std::regex e (delimiter); // matches words beginning by the i-th delimiter
// default constructor = end-of-sequence:
std::sregex_token_iterator rend;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::sregex_token_iterator c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
}
return 0;
}
output:
1st and 2nd submatches:[this][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[is][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[the][rhsuidthing345]
1st and 2nd submatches:[thing][345]
I think I need to make some recursive thing to call on each iteration
Build the expression you want for matches only (re), then pass in {-1, 0} to your std::sregex_token_iterator to return all non-matches (-1) and matches (0).
#include <iostream>
#include <regex>
int main() {
std::string s("this1245is#g$0,therhsuidthing345");
std::regex re("(this|is|the|thing)");
std::sregex_token_iterator iter(s.begin(), s.end(), re, { -1, 0 });
std::sregex_token_iterator end;
while (iter != end) {
//Works in vc13, clang requires you increment separately,
//haven't gone into implementation to see if/how ssub_match is affected.
//Workaround: increment separately.
//std::cout << "[" << *iter++ << "] ";
std::cout << "[" << *iter << "] ";
++iter;
}
}
I don't know how to perform the precedence requirement. This seems to work on the given input:
std::vector<std::string> parse (std::string s)
{
std::vector<std::string> out;
std::regex re("\(this|is|the|thing).*");
std::string word;
auto i = s.begin();
while (i != s.end()) {
std::match_results<std::string::iterator> m;
if (std::regex_match(i, s.end(), m, re)) {
if (!word.empty()) {
out.push_back(word);
word.clear();
}
out.push_back(std::string(m[1].first, m[1].second));
i += out.back().size();
} else {
word += *i++;
}
}
if (!word.empty()) {
out.push_back(word);
}
return out;
}
vector<string> strs;
boost::split(strs,line,boost::is_space());