Writing a very simple lexical analyser in C++ - c++

NOTE : I'm using C++14 flag to compile... I am trying to create a very simple lexer in C++. I am using regular expressions to identify different tokens . My program is able to identify the tokens and display them. BUT THE out is of the form
int
main
hello
2
*
3
+
return
I want the output to be in the form
int IDENTIFIER
hello IDENTIFIER
* OPERATOR
3 NUMBER
so on...........
I am not able to achieve the above output.
Here is my program:
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <map>
using namespace std;
int main()
{
string str = " hello how are 2 * 3 you? 123 4567867*98";
// define list of token patterns
map<string, string> v
{
{"[0-9]+" , "NUMBERS"} ,
{"[a-z]+" , "IDENTIFIERS"},
{"[\\*|\\+", "OPERATORS"}
};
// build the final regex
string reg = "";
for(auto it = v.begin(); it != v.end(); it++)
reg = reg + it->first + "|";
// remove extra trailing "|" from above instance of reg..
reg.pop_back();
cout << reg << endl;
regex re(reg);
auto words_begin = sregex_iterator(str.begin(), str.end(), re);
auto words_end = sregex_iterator();
for(sregex_iterator i = words_begin; i != words_end; i++)
{
smatch match = *i;
string match_str = match.str();
cout << match_str << "\t" << endl;
}
return 0;
}
what is the most optimal way of doing it and also maintain the order of tokens as they appear in the source program?

I managed to do this with only one iteration over the parsed string. All you have to do is add parentheses around regex for each token type, then you'll be able to access the strings of these submatches. If you get a non-empty string for a submatch, that means it was matched. You know the index of the submatch and therefore the index in v.
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
#include <vector>
int main()
{
std::string str = " hello how are 2 * 3 you? 123 4567867*98";
// use std::vector instead, we need to have it in this order
std::vector<std::pair<std::string, std::string>> v
{
{"[0-9]+" , "NUMBERS"} ,
{"[a-z]+" , "IDENTIFIERS"},
{"\\*|\\+", "OPERATORS"}
};
std::string reg;
for(auto const& x : v)
reg += "(" + x.first + ")|"; // parenthesize the submatches
reg.pop_back();
std::cout << reg << std::endl;
std::regex re(reg, std::regex::extended); // std::regex::extended for longest match
auto words_begin = std::sregex_iterator(str.begin(), str.end(), re);
auto words_end = std::sregex_iterator();
for(auto it = words_begin; it != words_end; ++it)
{
size_t index = 0;
for( ; index < it->size(); ++index)
if(!it->str(index + 1).empty()) // determine which submatch was matched
break;
std::cout << it->str() << "\t" << v[index].second << std::endl;
}
return 0;
}
std::regex re(reg, std::regex::extended); is for matching for the longest string which is necessary for a lexical analyzer. Otherwise it might identify while1213 as while and number 1213 and depends on the order you define for the regex.

This is a quick and dirty solution iterating on each pattern, and for each pattern trying to match the entire string, then iterating over matches and storing each match with its position in a map. The map implicitly sorts the matches by key (position) for you, so then you just need to iterate the map to get the matches in positional order, regardless of their pattern name.
#include <iterator>
#include <iostream>
#include <string>
#include <regex>
#include <list>
#include <map>
using namespace std;
int main(){
string str = " hello how are 2 * 3 you? 123 4567867*98";
// define list of patterns
map<string,string> patterns {
{ "[0-9]+" , "NUMBERS" },
{ "[a-z]+" , "IDENTIFIERS" },
{ "\\*|\\+", "OPERATORS" }
};
// storage for results
map< size_t, pair<string,string> > matches;
for ( auto pat = patterns.begin(); pat != patterns.end(); ++pat )
{
regex r(pat->first);
auto words_begin = sregex_iterator( str.begin(), str.end(), r );
auto words_end = sregex_iterator();
for ( auto it = words_begin; it != words_end; ++it )
matches[ it->position() ] = make_pair( it->str(), pat->second );
}
for ( auto match = matches.begin(); match != matches.end(); ++match )
cout<< match->second.first << " " << match->second.second << endl;
}
Output:
hello IDENTIFIERS
how IDENTIFIERS
are IDENTIFIERS
2 NUMBERS
* OPERATORS
3 NUMBERS
you IDENTIFIERS
123 NUMBERS
4567867 NUMBERS
* OPERATORS
98 NUMBERS

Related

Regex search overlapping matches c++11

What regex expression should I use to search all occurrences that match:
Start with 55 or 66
followed by a minimum 8 characters in the range of [0-9a-fA-F] (HEX numbers)
Ends with \r (a carriage return)
Example string: 0205065509085503400066/r09\r
My expected result:
5509085503400066\r
5503400066\r
My current result:
5509085503400066\r
Using
(?:55|66)[0-9a-fA-F]{8,}\r
As you can sie, this finds onlny the first result but not the second one.
Edit clarification
I search the string using Regex. It'll select the message for further parsing. The target string can start anywhere in the string. The target string is only valid if it only contains base-16 (HEX) numbers, and ends with a carriage return.
[start] [information part minimum 8 chars] [end symbol-carigge return]
I'm using the std::regex library in c++11 with the flag ECMAScript
Edit
I have created an alternative solution that gives me the expected result. But this is not pure regex.
#include <iostream>
#include <string>
#include <regex>
int main()
{
// repeated search (see also
std::regex_iterator)
std::string log("0055\r0655036608090705\r");
std::regex r("(?:55|66)[0-9a-fA-F]{8,}\r");
std::smatch sm;
while(regex_search(log, sm, r))
{
std::cout << sm.str() << '\n';
log = sm.str();
log += sm.suffix();
log[0] = 'a' ;
}
}
** Edit: Working regex solution based on comments **
#include <iostream>
#include <string>
#include <regex>
int main()
{
// repeated search (see also
std::regex_iterator)
std::string s("0055\r06550003665508090705\r0970");
std::regex r("(?=((?:55|66)[0-9a-fA-F]{8,}\r))");
auto words_begin =
std::sregex_iterator(s.begin(), s.end(), r);
auto words_end = std::sregex_iterator();
std::cout << "Found "
<< std::distance(words_begin, words_end)
<< " words:\n";
for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
std::smatch match = *i;
std::string match_str = s.substr(match.position(1), match.length(1) - 1); //-1 cr
std::cout << match_str << " =" << match.position(1) << '\n';
}
}
Your are actually looking for overlapping matches. This can be achieved using a regex lookahead like this:
(?=((?:55|66)[0-9a-fA-F]{8,}\/r))
You will find the matches in question in group 1. The full-match, however, is empty.
Regex Demo (using /r instead of a carriage return for demonstration purposes only)
Sample Code:
#include <iostream>
#include <string>
#include <regex>
using namespace std;
int main() {
std::string subject("0055\r06550003665508090705\r0970");
try {
std::regex re("(?=((?:55|66)[0-9a-fA-F]{8,}\r))");
std::sregex_iterator next(subject.begin(), subject.end(), re);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str(1) << "\n";
next++;
}
} catch (std::regex_error& e) {
// Syntax error in the regular expression
}
return 0;
}
See also: Regex-Info: C++ Regular Expressions with std::regex

Multiple substrings in between the same delimiters

I am new to c++ and would like to know how to extract multiple substrings, from a single string, in between the same delimiters?
ex.
"{("id":"4219","firstname":"Paul"),("id":"4349","firstname":"Joe"),("id":"4829","firstname":"Brandy")}"
I want the ids:
4219 , 4349 , 4829
You can use regex to match the ids:
#include <iostream>
#include <regex>
int main() {
// This is your string.
std::string s{ R"({("id":"4219","firstname":"Paul"),("id":"4349","firstname":"Joe"),"("id":"4829","firstname":"Brandy")})"};
// Matches "id":"<any number of digits>"
// The id will be captured in the first group
std::regex r(R"("id"\s*:\s*"(\d+))");
// Make iterators that perform the matching
auto ids_begin = std::sregex_iterator(s.begin(), s.end(), r);
auto ids_end = std::sregex_iterator();
// Iterate the matches and print the first group of each of them
// (where the id is captured)
for (auto it = ids_begin; it != ids_end; ++it) {
std::smatch match = *it;
std::cout << match[1].str() << ',';
}
}
See it live on Coliru
Well, here is the q&d hack:
#include <iostream>
#include <sstream>
#include <string>
int main()
{
std::string s{ "{(\"id\":\"4219\",\"firstname\":\"Paul\"),"
"(\"id\":\"4349\",\"firstname\":\"Joe\"),"
"(\"id\":\"4829\",\"firstname\":\"Brandy\")}"
};
std::string id{ "\"id\":\"" };
for (auto f = s.find("\"id\":\""); f != s.npos; f = s.find(id, f)) {
std::istringstream iss{ std::string{ s.begin() + (f += id.length()), s.end() } };
int id; iss >> id;
std::cout << id << '\n';
}
}
Reliable? Well, just hope nobody names children "id":" ...

C++ substring contained between 2 specific characters

In c++ would like to extract all substrings in a string contained between specific characters, as example:
std::string str = "XPOINT:D#{MON 3};S#{1}"
std::vector<std:string> subsplit = my_needed_magic_function(str,"{}");
std::vector<int>::iterator it = subsplit.begin();
for(;it!=subsplit.end(),it++) std::cout<<*it<<endl;
result of this call should be:
MON 3
1
also using boost if needed.
You could try Regex:
#include <iostream>
#include <iterator>
#include <string>
#include <regex>
int main()
{
std::string s = "XPOINT:D#{MON 3};S#{1}.";
std::regex word_regex(R"(\{(.*?)\})");
auto first = std::sregex_iterator(s.begin(), s.end(), word_regex),
last = std::sregex_iterator();;
while (first != last)
std::cout << first++->str() << ' ';
}
Prints
{MON 3} {1}
Demo.

C++ split string using a list of words as separators

I would like to split a string like this one
“this1245is#g$0,therhsuidthing345”
using a list of words like the one bellow
{“this”, “is”, “the”, “thing”}
into this list
{“this”, “1245”, “is”, “#g$0,”, “the”, “rhsuid”, “thing”, “345”}
// ^--------------^---------------^------------------^-- these were the delimiters
The delimiters are allowed to appear more than once in the string to split, and it can be done using regular expressions
The precedence is in the order in which the delimiters appear in the array
The platform I'm developing for has no support for the Boost library
Update
This is what I have for the moment
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this1245is#g$0,therhsuidthing345");
std::string delimiters[] = {"this", "is", "the", "thing"};
for (int i=0; i<4; i++) {
std::string delimiter = "(" + delimiters[i] + ")(.*)";
std::regex e (delimiter); // matches words beginning by the i-th delimiter
// default constructor = end-of-sequence:
std::sregex_token_iterator rend;
std::cout << "1st and 2nd submatches:";
int submatches[] = { 1, 2 };
std::sregex_token_iterator c ( s.begin(), s.end(), e, submatches );
while (c!=rend) std::cout << " [" << *c++ << "]";
std::cout << std::endl;
}
return 0;
}
output:
1st and 2nd submatches:[this][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[is][x1245fisA#g$0,therhsuidthing345]
1st and 2nd submatches:[the][rhsuidthing345]
1st and 2nd submatches:[thing][345]
I think I need to make some recursive thing to call on each iteration
Build the expression you want for matches only (re), then pass in {-1, 0} to your std::sregex_token_iterator to return all non-matches (-1) and matches (0).
#include <iostream>
#include <regex>
int main() {
std::string s("this1245is#g$0,therhsuidthing345");
std::regex re("(this|is|the|thing)");
std::sregex_token_iterator iter(s.begin(), s.end(), re, { -1, 0 });
std::sregex_token_iterator end;
while (iter != end) {
//Works in vc13, clang requires you increment separately,
//haven't gone into implementation to see if/how ssub_match is affected.
//Workaround: increment separately.
//std::cout << "[" << *iter++ << "] ";
std::cout << "[" << *iter << "] ";
++iter;
}
}
I don't know how to perform the precedence requirement. This seems to work on the given input:
std::vector<std::string> parse (std::string s)
{
std::vector<std::string> out;
std::regex re("\(this|is|the|thing).*");
std::string word;
auto i = s.begin();
while (i != s.end()) {
std::match_results<std::string::iterator> m;
if (std::regex_match(i, s.end(), m, re)) {
if (!word.empty()) {
out.push_back(word);
word.clear();
}
out.push_back(std::string(m[1].first, m[1].second));
i += out.back().size();
} else {
word += *i++;
}
}
if (!word.empty()) {
out.push_back(word);
}
return out;
}
vector<string> strs;
boost::split(strs,line,boost::is_space());

how to iterate all regex matches in a std::string with their starting positions in c++11 std::regex?

I know two ways of getting regex matches from std::string, but I don't know how to get all matches with their respective offsets.
#include <string>
#include <iostream>
#include <regex>
int main() {
using namespace std;
string s = "123 apples 456 oranges 789 bananas oranges bananas";
regex r = regex("[a-z]+");
const sregex_token_iterator end;
// here I know how to get all occurences
// but don't know how to get starting offset of each one
for (sregex_token_iterator i(s.cbegin(), s.cend(), r); i != end; ++i) {
cout << *i << endl;
}
cout << "====" << endl;
// and here I know how to get position
// but the code is finding only first match
smatch m;
regex_search ( s, m, r );
for (unsigned i=0; i< m.size(); ++i) {
cout << m.position(i) << endl;
cout << m[i] << endl;
}
}
First of all, why token iterator? You don't have any marked subexpressions to iterate over.
Second, position() is a member function of a match, so:
#include <string>
#include <iostream>
#include <regex>
int main() {
std::string s = "123 apples 456 oranges 789 bananas oranges bananas";
std::regex r("[a-z]+");
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i )
{
std::smatch m = *i;
std::cout << m.str() << " at position " << m.position() << '\n';
}
}
live at coliru: http://coliru.stacked-crooked.com/a/492643ca2b6c5dac