Flex lexer output modification - c++

How can I use flex lexer in C++ and modify a token's yytext value?
Lets say, I have a rule like this:
"/*" {
char c;
while(true)
{
c = yyinput();
if(c == '\n')
++mylineno;
if (c==EOF){
yyerror( "EOF occured while processing comment" );
break;
}
else if(c == '*')
{
if((c = yyinput()) == '/'){
return(tokens::COMMENT);}
else
unput(c);
}
}
}
And I want to get token tokens::COMMENT with value of comment between /* and */.
(The bove solution gives "/*" as the value.
Additional, very important is tracking the line number, so I'm looking for solution supporting it.
EDIT
Of course I can modify the yytext and yyleng values (like yytext+=1; yyleng-=1, but still I cannot solve the above problem)

I still think start conditions are the right answer.
%x C_COMMENT
char *str = NULL;
void addToString(char *data)
{
if(!str)
{
str = strdup(data);
}
else
{
/* handle string concatenation */
}
}
"/*" { BEGIN(C_COMMENT); }
<C_COMMENT>([^*\n\r]|(\*+([^*/\n\r])))* { addToString(yytext); }
<C_COMMENT>[\n\r] { /* handle tracking, add to string if desired */ }
<C_COMMENT>"*/" { BEGIN(INITIAL); }
I used the following as references:
http://ostermiller.org/findcomment.html
https://stackoverflow.com/a/2130124/1003855
You should be able to use a similar regular expression to handle strings.

Related

How to tokenize special characters depending on whitespace (< > | & etc.)

I found a project done a few years ago found here that does some simple command line parsing. While I really like it's functionality, it does not support parsing special characters, such as <, >, &, etc. I went ahead and attempted to add some functionality to parse these characters specifically by adding some of the same conditions that the existing code used to look for whitespace, escape characters, and quotes:
bool _isQuote(char c) {
if (c == '\"')
return true;
else if (c == '\'')
return true;
return false;
}
bool _isEscape(char c) {
if (c == '\\')
return true;
return false;
}
bool _isWhitespace(char c) {
if (c == ' ')
return true;
else if(c == '\t')
return true;
return false;
}
.
.
.
What I added:
bool _isLeftCarrot(char c) {
if (c == '<')
return true;
return false;
}
bool _isRightCarrot(char c) {
if (c == '>')
return true;
return false;
}
and so on for the rest of the special characters.
I also tried the same approach as the existing code in the parse method:
std::list<string> parse(const std::string& args) {
std::stringstream ain(args); // iterates over the input string
ain >> std::noskipws; // ensures not to skip whitespace
std::list<std::string> oargs; // list of strings where we will store the tokens
std::stringstream currentArg("");
currentArg >> std::noskipws;
// current state
enum State {
InArg, // scanning the string currently
InArgQuote, // scanning the string that started with a quote currently
OutOfArg // not scanning the string currently
};
State currentState = OutOfArg;
char currentQuoteChar = '\0'; // used to differentiate between ' and "
// ex. "sample'text"
char c;
std::stringstream ss;
std::string s;
// iterate character by character through input string
while(!ain.eof() && (ain >> c)) {
// if current character is a quote
if(_isQuote(c)) {
switch(currentState) {
case OutOfArg:
currentArg.str(std::string());
case InArg:
currentState = InArgQuote;
currentQuoteChar = c;
break;
case InArgQuote:
if (c == currentQuoteChar)
currentState = InArg;
else
currentArg << c;
break;
}
}
// if current character is whitespace
else if (_isWhitespace(c)) {
switch(currentState) {
case InArg:
oargs.push_back(currentArg.str());
currentState = OutOfArg;
break;
case InArgQuote:
currentArg << c;
break;
case OutOfArg:
// nothing
break;
}
}
// if current character is escape character
else if (_isEscape(c)) {
switch(currentState) {
case OutOfArg:
currentArg.str(std::string());
currentState = InArg;
case InArg:
case InArgQuote:
if (ain.eof())
{
currentArg << c;
throw(std::runtime_error("Found Escape Character at end of file."));
}
else {
char c1 = c;
ain >> c;
if (c != '\"')
currentArg << c1;
ain.unget();
ain >> c;
currentArg << c;
}
break;
}
}
What I added in the parse method:
// if current character is left carrot (<)
else if(_isLeftCarrot(c)) {
// convert from char to string and push onto list
ss << c;
ss >> s;
oargs.push_back(s);
}
// if current character is right carrot (>)
else if(_isRightCarrot(c)) {
ss << c;
ss >> s;
oargs.push_back(s);
}
.
.
.
else {
switch(currentState) {
case InArg:
case InArgQuote:
currentArg << c;
break;
case OutOfArg:
currentArg.str(std::string());
currentArg << c;
currentState = InArg;
break;
}
}
}
if (currentState == InArg) {
oargs.push_back(currentArg.str());
s.clear();
}
else if (currentState == InArgQuote)
throw(std::runtime_error("Starting quote has no ending quote."));
return oargs;
}
parse will return a list of strings of the tokens.
However, I am running into issues with a specific test case when the special character is attached to the end of the input. For example, the input
foo-bar&
will return this list: [{&},{foo-bar}] instead of what I want: [{foo-bar},{&}]
I'm struggling to fix this issue. I am new to C++ so any advice along with some explanation would be great help.
When you handle one of your characters, you need to do the same sorts of things that the original code does when it encounters a space. You need to look at the currentState, then save the current argument if you are in the middle of one (and reset it since you no longer are in one).

Retrieve each token from a file according to specific criteria

I'm trying to create a lexer for a functional language, one of the methods of which should allow, on each call, to return the next token of a file.
For example :
func main() {
var MyVar : integer = 3+2;
}
So I would like every time the next method is called, the next token in that sequence is returned; in that case, it would look like this :
func
main
(
)
{
var
MyVar
:
integer
=
3
+
2
;
}
Except that the result I get is not what I expected:
func
main(
)
{
var
MyVar
:
integer
=
3+
2
}
Here is my method:
token_t Lexer::next() {
token_t ret;
std::string token_tmp;
bool IsSimpleQuote = false; // check string --> "..."
bool IsDoubleQuote = false; // check char --> '...'
bool IsComment = false; // check comments --> `...`
bool IterWhile = true;
while (IterWhile) {
bool IsInStc = (IsDoubleQuote || IsSimpleQuote || IsComment);
std::ifstream file_tmp(this->CurrentFilename);
if (this->eof) break;
char chr = this->File.get();
char next = file_tmp.seekg(this->CurrentCharIndex + 1).get();
++this->CurrentCharInCurrentLineIndex;
++this->CurrentCharIndex;
{
if (!IsInStc && !IsComment && chr == '`') IsComment = true; else if (!IsInStc && IsComment && chr == '`') { IsComment = false; continue; }
if (IsComment) continue;
if (!IsInStc && chr == '"') IsDoubleQuote = true;
else if (!IsInStc && chr == '\'') IsSimpleQuote = true;
else if (IsDoubleQuote && chr == '"') IsDoubleQuote = false;
else if (IsSimpleQuote && chr == '\'') IsSimpleQuote = false;
}
if (chr == '\n') {
++this->CurrentLineIndex;
this->CurrentCharInCurrentLineIndex = -1;
}
token_tmp += chr;
if (!IsInStc && IsLangDelim(chr)) IterWhile = false;
}
if (token_tmp.size() > 1 && System::Text::EndsWith(token_tmp, ";") || System::Text::EndsWith(token_tmp, " ")) token_tmp.pop_back();
++this->NbrOfTokens;
location_t pos;
pos.char_pos = this->CurrentCharInCurrentLineIndex;
pos.filename = this->CurrentFilename;
pos.line = this->CurrentLineIndex;
SetToken_t(&ret, token_tmp, TokenList::ToToken(token_tmp), pos);
return ret;
}
Here is the function IsLangDelim :
bool IsLangDelim(char chr) {
return (chr == ' ' || chr == '\t' || TokenList::IsSymbol(CharToString(chr)));
}
TokenList is a namespace that contains the list of tokens, as well as some functions (like IsSymbol in this case).
I have already tried other versions of this method, but the result is almost always the same.
Do you have any idea how to improve this method?
The solution for your problem is using a std::regex. Understanding the syntax is, in the beginning, a little bit difficult, but after you understand it, you will always use it.
And, it is designed to find tokens.
The specific critera can be expressed in the regex string.
For your case I will use: std::regex re(R"#((\w+|\d+|[;:\(\)\{\}\+\-\*\/\%\=]))#");
This means:
Look for one or more characters (That is a word)
Look for one or more digits (That is a integer number)
Or look for all kind of meaningful operators (Like '+', '-', '{' and so on)
You can extend the regex for all the other stuff that you are searching. You can also regex a regex result.
Please see example below. That will create your shown output from your provided input.
And, your described task is only one statement in main.
#include <iostream>
#include <string>
#include <algorithm>
#include <regex>
// Our test data (raw string) .
std::string testData(
R"#(func main() {
var MyVar : integer = 3+2;
}
)#");
std::regex re(R"#((\w+|\d+|[;:\(\)\{\}\+\-\*\/\%\=]))#");
int main(void)
{
std::copy(
std::sregex_token_iterator(testData.begin(), testData.end(), re, 1),
std::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n")
);
return 0;
}
You try to parse using single loop, which makes the code very complicated. Instead i suggest something like this:
struct token { ... };
struct lexer {
vector<token> tokens;
string source;
unsigned int pos;
bool parse_ident() {
if (!is_alpha(source[pos])) return false;
auto start = pos;
while(pos < source.size() && is_alnum(source[pos])) ++pos;
tokens.push_back({ token_type::ident, source.substr(start, pos - start) });
return true;
}
bool parse_num() { ... }
bool parse_comment() { ... }
...
bool parse_whitespace() { ... }
void parse() {
while(pos < source.size()) {
if (!parse_comment() && !parse_ident() && !parse_num() && ... && !parse_comment()) {
throw error{ "unexpected character at position " + std::to_string(pos) };
}
}
}
This is standard structure i use, when lexing my files in any scripting language i've written. Lexing is usually greedy, so you don't need to bother with regex (which is effective, but slower, unless some crazy template based implementation). Just define your parse_* functions, make sure they return false, if they didn't parsed a token and make sure they are called in correct order.
Order itself doesn't matter usually, but:
operators needs to be checked from longest to shortest
number in style .123 might be incorrectly recognized as . operator (so you need to make sure, that after . there is no digit.
numbers and identifiers are very lookalike, except that identifiers starts with non-number.

Recursion cin.getline() doesn't want to ask for input

Here is the code:
void Reader::read(short& in) {
char* str = new char[6];
char* strbeg = str;
cin.getline(str, 6);
in = 0;
int value = 0;
short sign = 1;
if (*str == '+' || *str == '-') {
if (*str == '-' ) sign = -1;
str++;
}
while (isdigit(*str)) {
value *= 10;
value += (int) (*str - '0');
str++;
if (value > 32767) {
cout.write("Error, value can't fit short. Try again.\n", 41);
delete[] strbeg;
read(in);
return;
}
}
if (sign == -1) { value *= -1; }
in = (short) value;
delete[] strbeg;
return;
}
What happens is that if I type 999999999, it calls itself but on fourth line it's not gonna ask for input again. Debugger couldn't give much info as it is more language-specific question. Thank you in advance. Have a nice day!
Yes, the goal is to parse input as short. I know about losing 1 from min negative, wip :)
=== edit ===
I've tried goto... No, same thing. So it's not about visible variables or addresses, I guess.
=== edit ===
I can't use operator >> as it is forbidden by the task.
999999999 will cause an overflow, thus failbit is set for cin. Then your program reach read(in), then the cin.getline(). Here, beacause of failbit, cin will not ask any input again.
If you tried to figure out why in my code cin do ask for more input, you might find out all this by yourself.
I write you an example.
#include <iostream>
#include <climits>
using namespace std;
int main() {
char str[6];
short x = 0;
bool flag = false;
while (flag == false) {
cin.getline(str, 6);
flag = cin.good();
if (flag) { // if read successfully
char *p = str;
if (*p=='-') // special case for the first character
++p;
while (*p && *p>='0' && *p<='9')
++p;
if (*p) // there is a non digit non '\0' character
flag = false;
}
if (flag == false) {
cout << "An error occurred, try try again." << endl;
if (!cin.eof()) {
cin.unget(); // put back the possibly read '\n'
cin.ignore(INT_MAX, '\n');
}
cin.clear();
} else {
// str is now ready for parsing
// TODO: do your parsing work here
// for exemple x = atoi(str);
}
}
std::cout << x << std::endl;
return 0;
}
As we have discussed, you don't need new.
Check whether the string read is clean before parsing. If you mix checking and parsing, things will be complicated.
And you don't need recursion.
Read characters from stream by istream::getline seems to be the only option we have here. And when an error occurred, this function really doesn't tell us much, we have to deal with overflow and other problem separately.

How to perform other operations when my tokenizer recognizes a token?

I have written a simple tokenizer that will split a command line into seperate lines each containing a single word. I am trying to ...
Make the program close if the first word of a command line is "quit"
Recognize instructions such as "Pickup", "Save", and "Go" in which the compiler will then look to the next token.
My idea has been to use a simple switch with cases to check for these commands, but I cannot figure out where to place it.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char command[256];
int commandIndex;
char token[32];
int isWhiteSpace(char character) {
if (character == ' ') {
return 1;
}
else if(character == '\t') {
return 1;
}
else if(character < ' ') {
return 1;
}
else {
return 0;
}
} char* getToken() {
int index = 0; // Skip white spaces
while(commandIndex<256 && isWhiteSpace(command[commandIndex])) {
commandIndex ++;
} // If at end of line return empty token
if(commandIndex>=256) {
token[0] = 0;
return token;
} // Capture token
while(commandIndex<256 && !isWhiteSpace(command[commandIndex])) {
token[index] = command[commandIndex];
index++;
commandIndex ++;
}
token[index] = 0;
return token;
}
void main() {
printf("Zeta - Version 2.0\n");
while(1) {
printf("Command: ");
gets_s(command);
commandIndex = 0;
char* token = getToken();
while (strcmp(token,"") != 0) {
printf("%s\n", token);
token = getToken();
}
}
}
A little reorganization of the loop you have in main will do it.
int main() {
printf("Zeta - Version 2.0\n");
bool done = false;
while (!done) {
printf("Command: ");
gets_s(command);
commandIndex = 0;
char* token = getToken();
if (strcmp(token, "quit") == 0) {
done = true;
} else if (strcmp(token, "pickup") == 0) {
doPickup();
} else if (strcmp(token, "save") == 0) {
char * filename = getToken();
doSave(filename);
} ...
}
return 0;
}
You can't use a switch statement with strings, so you just use a bunch of if ... else if ... statements to check for each command. There are other approaches, but this one required the fewest changes from the code you already have.
In the example, under the handling for "save" I showed how you can just call getToken again to get the next token on the same command line.
(Note that I also fixed the return value for main. Some compilers will let you use void, but that's not standard so it's best if you don't do that.)

"\b" is added to the string as a character

In the below code the "\b" removes a char from the string, but it increases its size as if the char could be inside it but not visible.
while (true) {
c = _getch();
if (c=='\r') {break;}
else if (c=='\b') { cout<<"\b"<<" "<<"\b"; s+="\b \b"; }
else {cout<<"*"; s=s+c;}
}
For instance the the size of this string (abc"\b"d), "c is removed and replaced by d", is still 5.
I would like to know how to efficiently handle the backspace in this circumstance.
If you are reading character by character into a string, you could do something like this:
std::string mystr;
while (true) {
c = _getch();
if (c=='\r') {break;}
if(c == '\b')
{
// This will remove last character from your string
if(mystr.size () > 0)
{
cout<<"\b"<<" "<<"\b";
mystr.resize (mystr.size () - 1);
// or mystr.pop_back() in C++11
}
}
else
{
cout<<"*";
mystr += c;
}
}
You need to "physically" remove the last character from the string when you get a backspace:
while (true) {
c = _getch();
if (c=='\r') {
break;
}
if (c=='\b') {
cout<<"\b"<<" "<<"\b";
if (s.length() > 0) {
s = s.substring(0, s.length()-1);
}
}
else {cout<<"*"; s=s+c;}
}
As an optimization, we can trim s instead of reassigning, as suggested by Jason:
s.resize(s.size() -1);
(While we're at it, we could save s.length() (or s.size()) into a local variable to avoid the extra call - assuming the compiler, knowing about std::string, doesn't do it already).
for(char c=_getch(); c!='\r'; c=_getch())
if(c=='\b')
mystr.pop_back();
else
mystr.push_back(c);