How to strip C++ style single line comments (`// ...`) - regex

For a small DSL I'm writing I'm looking for a regex to match a comment string at the end of the like the // syntax of C++.
The simple case:
someVariable = 12345; // assignment
Is trivial to match but the problem starts when I have a string in the same line:
someFunctionCall("Hello // world"); // call with a string
The // in the string should not match as a comment
EDIT - The thing that compiles the DSL is not mine. It's a black box as far as I'm which I don't want to change and it doesn't support comments. I just want to add a thin wrapper to make it support comments.

EDIT
Since you are effectively preprocessing a sourcefile, why don't you use an existing preprocessor? If the language is sufficiently similar to C/C++ (especially regarding quoting and string literals), you will be able to just use cpp -P:
echo 'int main() { char* sz="Hello//world"; /*profit*/ } // comment' | cpp -P
Output: int main() { char* sz="Hello//world"; }
Other ideas:
Use a proper lexer/parser instead
Have a look at
CoCo/R (available for Java, C++, C#, etc.)
ANTLR (idem)
Boost Spirit (with Spirit Lex to make it even easier to strip the comments)
All suites come with sample grammars that parse C, C++ or a subset thereof

shoosh wrote:
EDIT - The thing that compiles the DSL is not mine. It's a black box as far as I'm which I don't want to change and it doesn't support comments. I just want to add a thin wrapper to make it support comments.
In that case, create a very simple lexer that matches one of three tokens:
// ... comments
string literals: " ... "
or, if none of the above matches, match any single character
Now, while you iterate ov er these 3 different type of tokens, simply print tokens (2) and (3) to the stdout (or to a file) to get the uncommented version of your source file.
A demo with GNU Flex:
example input file, in.txt:
someVariable = 12345; // assignment
// only a comment
someFunctionCall("Hello // world"); // call with a string
someOtherFunctionCall("Hello // \" world"); // call with a string and
// an escaped quote
The lexer grammar file, demo.l:
%%
"//"[^\r\n]* { /* skip comments */ }
"\""([^"]|[\\].)*"\"" {printf("%s", yytext);}
. {printf("%s", yytext);}
%%
int main(int argc, char **argv)
{
while(yylex() != 0);
return 0;
}
And to run the demo, do:
flex demo.l
cc lex.yy.c -lfl
./a.out < in.txt
which will print the following to the console:
someVariable = 12345;
someFunctionCall("Hello // world");
someOtherFunctionCall("Hello // \" world");
EDIT
I'm not really familiar with C/C++, and just saw #sehe's recommendation of using a pre-processor. That looks to be a far better option than creating your own (small) lexer. But I think I'll leave this answer since it shows how to handle this kind of stuff if no pre-processor is available (for whatever reason: perhaps cpp doesn't recognise certain parts of the DSL?).

Related

Bison : Line number included in the error messages

OK, so I suppose my question is quite self-explanatory.
I'm currently building a parser in Bison, and I want to make error reporting somewhat better.
Currently, I've set %define parse.error verbose (which actually gives messages like syntax error, unexpected ***********************, expecting ********************.
All I want is to add some more information in the error messages, e.g. line number (in input/file/etc)
My current yyerror (well nothing... unusual... lol) :
void yyerror(const char *str)
{
fprintf(stderr,"\x1B[35mInterpreter : \x1B[37m%s\n",str);
}
P.S.
I've gone through the latest Bison documentation, but I seem quite lost...
I've also had a look into the %locations directive, which most likely is very close to what I need - however, I still found no complete working example and I'm not sure how this is to be used.
So, here I'm a with a step-by-step solution :
We add the %locations directive in our grammar file (between %} and the first %%)
We make sure that our lexer file contains an include for our parser (e.g. #include "mygrammar.tab.h"), at the top
We add the %option yylineno option in our lexer file (between %} and the first %%)
And now, in our yyerror function (which will supposedly be in our lexer file), we may freely use this... yylineno (= current line in file being processed) :
void yyerror(const char *str)
{
fprintf(stderr,"Error | Line: %d\n%s\n",yylineno,str);
}
Yep. Simple as that! :-)
Whats worked for me was adding extern int yylineno in .ypp file:
/* parser.ypp */
%{
extern int yylineno;
%}
/* scanner.lex */
...
%option yylineno
Bison ships with a number of examples to demonstrate its features, see /usr/local/share/doc/bison/examples on your machine (where the prefix /usr/local depends on your configuration.
These examples in particular might be of interest to you:
lexcalc uses precedence directives and location tracking. It uses Flex to generate the scanner.
bistromathic demonstrates best practices when using Bison.
Its hand-written scanner tracks locations.
Its interface is pure.
It uses %params to pass user information to the parser and scanner.
Its scanner uses the error token to signal lexical errors and enter
error recovery.
Its interface is "incremental", well suited for interaction: it uses the
push-parser API to feed the parser with the incoming tokens.
It features an interactive command line with completion based on the
parser state, based on yyexpected_tokens.
It uses Bison's standard catalog for internationalization of generated
messages.
It uses a custom syntax error with location, lookahead correction and
token internationalization.
Error messages quote the source with squiggles that underline the error:
> 123 456
1.5-7: syntax error: expected end of file or + or - or * or / or ^ before number
1 | 123 456
| ^~~
It supports debug traces with semantic values.
It uses named references instead of the traditional $1, $2, etc.

Reset the C/C++ preprocessor #line the physical file/line

I have a code generator that's going to take some user-written code and embed chunks of it in a larger generated file. I want the underlying compiler to provide good diagnostics when there are defects in the user's code, but I also don't want defects in the generated code to be misattributed to the source when they shouldn't be.
I intend to emit #line lineNum "sourceFile" directives at the beginning of each chunk of user-written code. However, I can't find any documentation of the #line directive that mentions a technique for 'resetting' __LINE__ and __FILE__ back to the actual line in the generated file once I leave the user-provided code. The ideal solution would be analogous to the C# preprocessor's #line default directive.
Do I just need to keep track of how many lines I've written and manually reset that myself? Or is there a better way, some sort of reset directive or sentinel value I can pass to #line to erase the association with the user's code?
It looks like this may have been posed before, though there's no solid answer there. To distinguish this from that, I'll additionally ask whether the lack of answer there has changed with C++11.
A technique I've used before is to have my code generator output a # by itself on a line when it wants to reset the line directives, and then use a simple awk script to postprocess the file and change those to correct line directives:
#!/bin/awk -f
/^#$/ { printf "#line %d \"%s\"\n", NR+1, FILENAME; next; }
{ print; }
Yes, you need to keep track of the number of lines you've output, and you need to know the name of the file you're outputting into. Remember that the line number you specify is the line number of the next line. So if you've written 12 lines so far, you need to output #line 14 "filename", since the #line directive will go on line 13, and so the next line is 14.
There's no difference between the #line preprocessor directive in C and C++.
Suppose the input to the code generator, "user.code", contains the following:
int foo () {
return error1 ();
}
int bar () {
return error2 ();
}
Suppose you want to augment this so it looks basically look like this:
int foo () {
return error1 ();
}
int generated_foo () {
return generated_error1 ();
}
int bar () {
return error2 ();
}
int generated_bar () {
return generated_error2 ();
}
Except you don't want that. You want to add #line directives to the generated code so that the compiler messages indicate whether the errors / warnings are from the user code or the autogenerated code. The #line directive indicates the source of the next line of code (rather than the line containing the #line directive).
#line 1 "user.code"
int foo () {
return error1 ();
}
#line 7 "generated_code.cpp" // NOTE: This is line #6 of generated_code.cpp
int generated_foo () {
return generated_error1 ();
}
#line 5 "user.code"
int bar () {
return error2 ();
}
#line 17 "generated_code.cpp" // NOTE: This is line #16 of generated_code.cpp
int generated_bar () {
return generated_error2 ();
}
#Novelocrat,
I had asked this question here before, and no solid answers were posted, but I figured out that if line directives are inserted in the auto-generated code that points to the user code, then this makes the auto-generated code hard to relocate. You have to keep the auto-generated and user code in the locations where the compiler can find them for reporting errors. I thought it was better to simply insert the file name and line numbers of the user code in the generated code. In good text editors it is a matter of a couple of keystrokes to jump to a line in a file by placing the cursor on the file name.
Eg: in vim placing the cursor on the file-name and pressing g-f takes you to the file, and :42 takes you to the line 42 (say) that had the error.
Just posting this bit here, so that someone else coming up with the same questions might consider this alternative too.
Have you tried what __LINE__ and __FILE__ give you? I believe they are taken from your #line directives (what would be the point if not?).
(A quick test with gcc-4.7.2 and clang-3.1 confirms my hunch).

"Conditional" parsing of command-line arguments

Say I have an executable (running on mac, win, and linux)
a.out [-a] [-b] [-r -i <file> -o <file> -t <double> -n <int> ]
where an argument in [ ] means that it is optional. However, if the last argument -r is set then -i,-o,-t, and -n have to be supplied, too.
There are lots of good C++-libraries out there to parse command-line arguments, e.g. gflags (http://code.google.com/p/gflags/), tclap (http://tclap.sourceforge.net/), simpleopt(http://code.jellycan.com/simpleopt/), boost.program_options (http://www.boost.org/doc/libs/1_52_0/doc/html/program_options.html), etc. But I wondered if there is one that lets me encode these conditional relationships between arguments directly, w/o manually coding error handling
if ( argR.isSet() && ( ! argI.isSet() || ! argO.isSet() || ... ) ) ...
and manually setting up the --help.
The library tclap allows to XOR arguments, e.g. either -a or -b is allowed but not both. So, in that terminology an AND for arguments would be nice.
Does anybody know a versatile, light-weight, and cross-platform library that can do that?
You could two passes over the arguments; If -r is in the options you reset the parser and start over with the new mandatory options added.
You could also look into how the TCLAP XorHandler works, and create your own AndHandler.
You could change the argument syntax so that -r takes four values in a row.
I have part of the TCLAP snippet of code lying around that seems to fit the error handling portion that you're looking for, however it doesn't match exactly what you're looking for:
# include "tclap/CmdLine.h"
namespace TCLAP {
class RequiredDependentArgException : public ArgException {
public:
/**
* Constructor.
* \param text - The text of the exception.
* \param parentArg - The text identifying the parent argument source
* \param dependentArg - The text identifying the required dependent argument
* of the exception.
*/
RequiredDependentArgException(
const TCLAP::Arg& parentArg,
const TCLAP::Arg& requiredArg)
: ArgException(
std::string( "Required argument ") +
requiredArg.toString() +
std::string(" missing when the ") +
parentArg.toString() +
std::string(" flag is specified."),
requiredArg.toString())
{ }
};
} // namespace TCLAP
And then make use of the new exception after TCLAP::CmdLine::parse has been called:
if (someArg.isSet() && !conditionallyRequiredArg.isSet()) {
throw(TCLAP::RequiredDependentArgException(someArg, conditionallyRequiredArg));
}
I remember looking in to extending and adding an additional class that would handle this logic, but then I realized the only thing I was actually looking for was nice error reporting because the logic wasn't entirely straightforward and couldn't be easily condensed (at least, not in a way that was useful to the next poor guy who came along). A contrived scenario dissuaded me from pursuing it further, something to the effect of, "if A is true, B must be set but C can't be set if D is of value N." Expressing such things in native C++ is the way to go, especially when it comes time to do very strict argument checks at CLI arg parse time.
For truly pathological cases and requirements, create a state machine using something like Boost.MSM (Multi-State Machine). HTH.
do you want to parse a command line?you can use simpleopt,it can be used as followings:downLoad simpleopt from:
https://code.google.com/archive/p/simpleopt/downloads
test:
int _tmain(int argc, TCHAR * argv[])
argv can be:1.txt 2.txt *.cpp

Compile a program with local file embedded as a string variable?

Question should say it all.
Let's say there's a local file "mydefaultvalues.txt", separated from the main project. In the main project I want to have something like this:
char * defaultvalues = " ... "; // here should be the contents of mydefaultvalues.txt
And let the compiler swap " ... " with the actual contents of mydefaultvalues.txt. Can this be done? Is there like a compiler directive or something?
Not exactly, but you could do something like this:
defaults.h:
#define DEFAULT_VALUES "something something something"
code.c:
#include "defaults.h"
char *defaultvalues = DEFAULT_VALUES;
Where defaults.h could be generated, or otherwise created however you were planning to do it. The pre-processor can only do so much. Making your files in a form that it will understand will make things much easier.
The trick I did, on Linux, was to have in the Makefile this line:
defaultvalues.h: defaultvalues.txt
xxd -i defaultvalues.txt > defaultvalues.h
Then you could include:
#include "defaultvalues.h"
There is defined both unsigned char defaultvalues_txt[]; with the contents of the file, and unsigned int defaultvalues_txt_len; with the size of the file.
Note that defaultvalues_txt is not null-terminated, thus, not considered a C string. But since you also have the size, this should not be a problem.
EDIT:
A small variation would allow me to have a null-terminated string:
echo "char defaultvalues[] = { " `xxd -i < defaultvalues.txt` ", 0x00 };" > defaultvalues.h
Obviously will not work very well if the null character is present inside the file defaultvalues.txt, but that won't happen if it is plain text.
One way to achieve compile-time trickery like this is to write a simple script in some interpreted programming language(e.g. Python, Ruby or Perl will do great) which does a simple search and replace. Then just run the script before compiling.
Define your own #pramga XYZ directive which the script looks for and replaces it with the code that declares the variable with file contents in a string.
char * defaultvalues = ...
where ... contains the text string read from the given text file. Be sure to compensate for line length, new lines, string formatting characters and other special characters.
Edit: lvella beat me to it with far superior approach - embrace the tools your environment supplies you. In this case a tool which does string search and replace and feed a file to it.
Late answer I know but I don't think any of the current answers address what the OP is trying to accomplish although zxcdw came really close.
All any 7 year old has to do is load your program into a hex editor and hit CTRL-S. If the text is in your executable code (or vicinity) or application resource they can find it and edit it.
If you want to prevent the general public from changing a resource or static data just encrypt it, stuff it in a resource then decrypt it at runtime. Try DES for something small to start with.

lex (flex) generated program not parsing whole input

I have a relatively simple lex/flex file and have been running it with flex's debug flag to make sure it's tokenizing properly. Unfortunately, I'm always running into one of two problems - either the program the flex generates stops just gives up silently after a couple of tokens, or the rule I'm using to recognize characters and strings isn't called and the default rule is called instead.
Can someone point me in the right direction? I've attached my flex file and sample input / output.
Edit: I've found that the generated lexer stops after a specific rule: "cdr". This is more detailed, but also much more confusing. I've posted a shorted modified lex file.
/* lex file*/
%option noyywrap
%option nodefault
%{
enum tokens{
CDR,
CHARACTER,
SET
};
%}
%%
"cdr" { return CDR; }
"set" { return SET; }
[ \t\r\n] /*Nothing*/
[a-zA-Z0-9\\!##$%^&*()\-_+=~`:;"'?<>,\.] { return CHARACTER; }
%%
Sample input:
set c cdra + cdr b + () ;
Complete output from running the input through the generated parser:
--(end of buffer or a NUL)
--accepting rule at line 16 ("set")
--accepting rule at line 18 (" ")
--accepting rule at line 19 ("c")
--accepting rule at line 18 (" ")
--accepting rule at line 15 ("cdr")
Any thoughts? The generated program is giving up after half of the input! (for reference, I'm doing input by redirecting the contents of a file to the generated program).
When generating a lexer that's standalone (that is, not one with tokens that are defined in bison/yacc, you typically write an enum at the top of the file defining your tokens. However, the main loop of a lex program, including the main loop generated by default, looks something like this:
while( token = yylex() ){
...
This is fine, until your lexer matches the rule that appears first in the enum - in this specific case CDR. Since enums by default start at zero, this causes the while loop to end. Renumbering your enum - will solve the issue.
enum tokens{
CDR = 1,
CHARACTER,
SET
};
Short version: when defining tokens by hand for a lexer, start with 1 not 0.
This rule
[-+]?([0-9*\.?[0-9]+|[0-9]+\.)([Ee][-+]?[0-9]+)?
|
seems to be missing a closing bracket just after the first 0-9, I added a | below where I think it should be. I couldn't begin to guess how flex would respond to that.
The rule I usually use for symbol names is [a-zA-Z$_], this is like your unquoted strings
except that I usually allow numbers inside symbols as long as the symbol doesn't start with a number.
[a-zA-Z$_]([a-zA-Z$_]|[0-9])*
A characters is just a short symbol. I don't think it needs to have its own rule, but if it does, then you need to insure that the string rule requires at least 2 characters.
[a-zA-Z$_]([a-zA-Z$_]|[0-9])+