Storing the current line being analysed by flex - c++

In my parser generated by flex, I would like to be able to store each line in the file, so that when reporting errors, I can show the user the line that the error occurred on.
I could of course do this using a vector and read in all lines from the file before/after lexing, but this would just add to the time needed to parse a file.
What I thought I could instead do, is to store the line whenever a new-line character is matched, and insert the current line into a vector. So my questions is, is there a variable/macro that flex that stores the current line inside? (Something like yyline perhaps)
Note: I am also using bison

By itself, lex/flex does not do what you ask. As noted, you want this for reporting error messages. (I do something like this in vi like emacs).
With lex/flex, the only way to store the entire line is to record each token from the current line into your own line-buffer. That can be complicated, especially if your lexer has to handle multi-line content (such as comments or strings).
The yytext variable only shows you the most recently parsed token (and yylength, the corresponding length). If your lexer does a simple ECHO, that is a token just like the ones you pay attention to.
Reading the file in advance as noted is one way to simplify the problem. In vi like emacs, the lexers read via a function from the in-memory buffer rather than from an input stream. It bypasses the normal stream-handling logic by redefining the YY_INPUT macro, e.g.,
#define YY_INPUT(buf,result,max_size) result = flt_input(buf,max_size)
Likewise, ECHO is redefined (since the editor reads the results back rather than letting them go to the standard output):
#define ECHO flt_echo(yytext, yyleng)
and it traps errors detected by the lexer with another redefinition:
#define YY_FATAL_ERROR(msg) flt_failed(msg);
However you do this, the yylineno value reported for a given token will be at the end of parsing a given token.
While it is nice to report the entire line in context in an error message, it is also useful to track the line and column number of each token -- various editors can deal with lines like this
filename:line:col:message
If you build up your line-buffer by tracking tokens, it might be relatively simple to track the column on which each token begins as well.

Related

ReportEvent: Logged messages run all lines together

I'm noticing that when logging a multi-line message using ReportEvent, it drops all line ends and runs the text together. For example, my MC file may have:
MessageId=
Severity=Informational
SymbolicName=MSG_TEST_MSG
Language=English
Some text
Another line of text.
Last line of text.
.
The message in Event Viewer shows all three lines run together.
If I put \r\n sequences in the text in insertion strings, those line ends do show up correctly in the logged message.
Also, if I use FormatMessageW to generate the text string of the above message, the line ends are correctly included in the text. They seem to be removed only when posting to Event Viewer.
I have not seen ANY reference to the fact that line ends are being dropped anywhere. Any idea? Is this just the way it is?
Thanks.
You have to use %n to force a "hard line break" inside the message.
Source:
%n
Generates a hard line break when it occurs at the end of a line. This
can be used with FormatMessage to ensure that the message fits a
certain width.
"Why does Format­Message say that %0 terminates the message without a trailing newline? Is it secretly adding newlines?" might also be interesting in this context.

Including files as raw string literals [duplicate]

This question already has answers here:
"#include" a text file in a C program as a char[]
(21 answers)
Closed 9 years ago.
I have a C++ source file and a Python source file. I'd like the C++ source file to be able to use the contents of the Python source file as a big string literal. I could do something like this:
char* python_code = "
#include "script.py"
"
But that won't work because there need to be \'s at the end of each line. I could manually copy and paste in the contents of the Python code and surround each line with quotes and a terminating \n, but that's ugly. Even though the python source is going to effectively be compiled into my C++ app, I'd like to keep it in a separate file because it's more organized and works better with editors (emacs isn't smart enough to recognize that a C string literal is python code and switch to python mode while you're inside it).
Please don't suggest I use PyRun_File, that's what I'm trying to avoid in the first place ;)
The C/C++ preprocessor acts in units of tokens, and a string literal is a single token. As such, you can't intervene in the middle of a string literal like that.
You could preprocess script.py into something like:
"some code\n"
"some more code that will be appended\n"
and #include that, however. Or you can use xxd​ -i to generate a C static array ready for inclusion.
This won't get you all the way there, but it will get you pretty damn close.
Assuming script.py contains this:
print "The current CPU time in seconds is: ", time.clock()
First, wrap it up like this:
STRINGIFY(print "The current CPU time in seconds is: ", time.clock())
Then, just before you include it, do this:
#define STRINGIFY(x) #x
const char * script_py =
#include "script.py"
;
There's probably an even tighter answer than that, but I'm still searching.
The best way to do something like this is to include the file as a resource if your environment/toolset has that capability.
If not (like embedded systems, etc.), you can use a bin2c utility (something like http://stud3.tuwien.ac.at/~e0025274/bin2c/bin2c.c). It'll take a file's binary representation and spit out a C source file that includes an array of bytes initialized to that data. You might need to do some tweaking of the tool or the output file if you want the array to be '\0' terminated.
Incorporate running the bin2c utility into your makefile (or as a pre-build step of whatever you're using to drive your builds). Then just have the file compiled and linked with your application and you have your string (or whatever other image of the file) sitting in a chunk of memory represented by the array.
If you're including a text file as string, one thing you should be aware of is that the line endings might not match what functions expect - this might be another thing you'd want to add to the bin2c utility or you'll want to make sure your code handles whatever line endings are in the file properly. Maybe modify the bin2c utility to have a '-s' switch that indicates you want a text file incorportated as a string so line endings will be normalized and a zero byte will be at the end of the array.
You're going to have to do some of your own processing on the Python code, to deal with any double-quotes, backslashes, trigraphs, and possibly other things, that appear in it. You can at the same time turn newlines into \n (or backslash-escape them) and add the double-quotes on either end. The result will be a header file generated from the Python source file, which you can then #include. Use your build process to automate this, so that you can still edit the Python source as Python.
You could use Cog as part of your build process (to do the preprocessing and to embed the code). I admit that the result of this is probably not ideal, since then you end up seeing the code in both places. But any time I see the "Python," "C++", and "Preprocessor" in closs proximity, I feel it deserves a mention.
Here is how automate the conversion with cmd.exe
------ html2h.bat ------
#echo off
echo const char * html_page = "\
sed "/.*/ s/$/ \\n\\/" ../src/page.html | sed s/\"/\\\x22/g
echo.
echo ";
It was called like
cmd /c "..\Debug\html2h.bat" > "..\debug\src\html.h"
and attached to the code by
#include "../Debug/src/html.h"
printf("%s\n", html_page);
This is quite system-dependent approach but, as most of the people, I disliked the hex dump.
Use fopen, getline, and fclose.

getline() text with UNIX formatting characters

I am writing a C++ program which reads lines of text from a .txt file. Unfortunately the text file is generated by a twenty-something year old UNIX program and it contains a lot of bizarre formatting characters.
The first few lines of the file are plain, English text and these are read with no problems. However, whenever a line contains one or more of these strange characters mixed in with the text, that entire line is read as characters and the data is lost.
The really confusing part is that if I manually delete the first couple of lines so that the very first character in the file is one of these unusual characters, then everything in the file is read perfectly. The unusual characters obviously just display as little ascii squiggles -arrows, smiley faces etc, which is fine. It seems as though a decision is being made automatically, without my knowledge or consent, based on the first line read.
Based on some googling, I suspected that the issue might be with the locale, but according to the visual studio debugger, the locale property of the ifstream object is "C" in both scenarios.
The code which reads the data is as follows:
//Function to open file at location specified by inFilePath, load and process data
int OpenFile(const char* inFilePath)
{
string line;
ifstream codeFile;
//open text file
codeFile.open(inFilePath,ios::in);
//read file line by line
while ( codeFile.good() )
{
getline(codeFile,line);
//check non-zero length
if (line != "")
ProcessLine(&line[0]);
}
//close line
codeFile.close();
return 1;
}
If anyone has any suggestions as to what might be going on or how to fix it, they would be very welcome.
From reading about your issues it sounds like you are reading in binary data, which will cause getline() to throw out content or simply skip over the line.
You have a couple of choices:
If you simply need lines from the data file you can first sanitise them by removing all non-printable characters (that is the "official" name for those weird ascii characters). On UNIX a tool such as strings would help you with that process.
You can off course also do this programmatically in your code by simply reading in X amount of data, storing it in a string, and then removing those characters that fall outside of the standard ASCII character range. This will most likely cause you to lose any unicode that may be stored in the file.
You change your program to understand the format and basically write a parser that allows you to parse the document in a more sane way.
If you can, I would suggest trying solution number 1, simply to see if the results are sane and can still be used. You mention that this is medical data, do you per-chance know what file format this is? If you are trying to find out and have access to a unix/linux machine you can use the utility file and maybe it can give you a clue (worst case it will tell you it is simply data).
If possible try getting a "clean" file that you can post the hex dump of so that we can try to provide better help than that what we are currently providing. With clean I mean that there is no personally identifying information in the file.
For number 2, open the file in binary mode. You mentioned using Windows, binary and non-binary files in std::fstream objects are handled differently, whereas on UNIX systems this is not the case (on most systems, I'm sure I'll get a comment regarding the one system that doesn't match this description).
codeFile.open(inFilePath,ios::in);
would become
codeFile.open(inFilePath, ios::in | ios::binary);
Instead of getline() you will want to become intimately familiar with .read() which will allow unformatted operations on the ifstream.
Reading will be like this:
// This code has not been tested!
char input[1024];
codeFile.read(input, 1024);
int actual_read = codeFile.gcount();
// Here you can process input, up to a maximum of actual_read characters.
//ProcessLine() // We didn't necessarily read a line!
ProcessData(input, actual_read);
The other thing as mentioned is that you can change the locale for the current stream and change the separator it considers a new line, maybe this will fix your issue without requiring to use the unformatted operators:
imbue the stream with a new locale that only knows about the newline. This method may or may not let your getline() function without issues.

Using boost spirit parser for in-line editing and autocompletion prompting

I am trying to design a server app which will read a command line over a socket stream (one character at a time). Obviously the simple way is to read characters up to the EOL and execute the command contained in the receive buffer.
Instead, I want to have it so that when a user starts entering a command line and then enters "?", the app will generate a list of all the parameters which are syntactically correct from that point in the parsing of the command line (this is similar to the way it is in some embedded devices that I have seen, like Cisco and Netscreen routers).
For example,
$ set interface ?
would display
> set interface [option] -- displays information about the specified interface.
>
> [option] must be one of the following:
> address [ip-addr]
> port [port-no]
> protocol [tcp|udp]
So basically, I would need to know where we were in the grammar, and what symbols are expected from that point forward.
It would also be nice if it could support simple line editing commands (BS, DEL, insert, left-arrow, right-arrow), and maybe even up-arrow/down-arrow for command history.
Can this be done using the boost spirit parser?
EDIT:
Simply put: Is there a simple way to create a boost spirit parser which (in addition to having a set of rules), immediately executes an action anytime '?' is encountered on the input stream (without having to explicitly encode the token '?' into the rules)?

Clearing parser state of a bison generated parser

I am using a C lexer that is Flex-generated, and a C++ parser that is Bison-generated. I have modified the parser to acccept only string input.
I am calling the parser function yyparse() in a loop, and reading line by line of user input. I stop the loop if the input is "quit".
The problem I am facing is that when input doesn't match any rule, then the parser stops abruptly, and at next iteration starts off at same state, expecting the rule which was stopped (due to syntax error) to complete.
It works fine if the input is valid and matches a parser rule.
On syntax error I have redefined the yyerror() function, that displays a simple error message.
How do I clear the state of the parser when the input doesn't match any parser rule, so that at next iteration the parser starts afresh?
According to my Lex & Yacc book there is a function yyrestart(file) .
Else (and I quote a paragraph of the book:
This means that you cannot restart a lexer just by calling yylex(). You have to reset it into the default state using BEGIN INITIAL, discard any input text buffered up by unput(), and otherwise arrange so that the next call to input() will start reading the new input.
Interesting question - I have a parser that can be compiled with Bison, Byacc, MKS Yacc or Unix Yacc, and I don't do anything special to deal with resetting the grammar whether it fails or succeeds. I don't use a Flex or Lex tokenizer; that is hand-coded, but it works strictly off strings. So, I have to agree with Gamecat; the most likely cause of the trouble is the lexical analyzer, rather than the parser proper.
(If you want to obtain my code, you can download SQLCMD from the IIUG (International Informix User Group) web site. Although the full product requires Informix ESQL/C, the grammar can, in principle, be converted into a standalone test program. Sadly, however, it appears I've not run that test for a while - there are some issues with the test compilation. Some structure element names changed in April 2006, plus there are linkage issues. I will need to re-reorganize the code so that the grammar can be tested standalone again.)