Ignoring Syntax Errors - c++

I have a file that contains an arbitrary number of lines of c++ code, each line of which is self-contained (meaning it is valid by itself in the main function). However, I do not know how many, if any, of the lines will have valid c++ syntax. An example file might be
int length, width; // This one is fine
template <class T className {}; // Throws a syntax error
What I want to do is write to a second file all the lines that have valid syntax. Currently, I've written a program in python that reads each line, places it into the following form
int main() {
// Line goes here
return 0;
}
and attempts to compile it, returning True if the compilation succeeds and False if it doesn't, which I then use to determine which lines to write to the output file. For example, the first line would generate a file containing
int main() {
int length, width;
return 0;
}
which would compile fine and return True to the python program. However, I'm curious if there is any sort of try-catch syntax that works with the compiler so I could put each line of the file in a try-catch block and write it to the output if no exception is thrown, or if there's a way I can tell the compiler to ignore syntax errors.
Edit: I've been asked for details about why I would need to do this, and I'll be the first to admit it's a strange question. The reason I'm doing this is because I have another program (of which I don't know all the implementation details) that writes a large number of lines to a file, each of which should be able to stand alone. I also know that this program will almost certainly write lines that have syntax errors. What I'm trying to do is write a program that will remove any invalid lines so that the resulting file can compile without error. What I have in my python program right now works, but I'm trying to figure out if there is a simpler way to do it.
Edit 2: Though I think I've got my answer - that I can't really play try-catch with the compiler, and that's good enough. Thanks everyone!

Individual lines of code that are syntactically correct in the context of a C++ source file are not necessarily syntactically correct by themselves.
For example this:
int length, width;
happens to be valid either as part of a main function or by itself -- but it has a different meaning (by itself it defines length and width as static objects).
This:
}
is valid in context, but not by itself.
There is typically no way for a compiler to ignore syntax errors. Once a syntax error has been encountered, the compiler has no way to interpret the rest of the code.
When you're reading English text, adfasff iyufoyur; ^^$(( -- but you can usually recover and recognize valid syntax after an error. Compilers for programming languages aren't designed to perform that kind of recovery; probably the nature of C++'s syntax would make it more difficult, and there's just not enough demand to make it worth doing.
I'm not sure what your criterion for a single line of code being "correct" is. One possibility might be to write the line of code to a file, contained in a definition of main:
int main() {
// insert arbitrary line here
}
and then compile the resulting source file. I'm not sure that I can see how that would be particularly useful, but it's about the closest I can come to what you're asking for.

What do you mean by "each line is self-contained"? If the syntax of a line of C++ code is valid may depend largely on the code before or after that line. A given line of code might be valid within a function, but not outside a function body. So, as long as you can't define what you mean by "self-contained" it is hard to solve your problem.

Related

Function to call #include macro from a string variable argument?

Is it possible to have a function like this:
const char* load(const char* filename_){
return
#include filename_
;
};
so you wouldn't have to hardcode the #include file?
Maybe with a some macro?
I'm drawing a blank, guys. I can't tell if it's flat out not possible or if it just has a weird solution.
EDIT:
Also, the ideal is to have this as a compile time operation, otherwise I know there's more standard ways to read a file. Hence thinking about #include in the first place.
This is absolutely impossible.
The reason is - as Justin already said in a comment - that #include is evaluated at compile time.
To include files during run time would require a complete compiler "on board" of the program. A lot of script languages support things like that, but C++ is a compiled language and works different: Compile and run time are strictly separated.
You cannot use #include to do what you want to do.
The C++ way of implementing such a function is:
Find out the size of the file.
Allocate memory for the contents of the file.
Read the contents of the file into the allocated memory.
Return the contents of the file to the calling function.
It will better to change the return type to std::string to ease the burden of dealing with dynamically allocated memory.
std::string load(const char* filename)
{
std::string contents;
// Open the file
std::ifstream in(filename);
// If there is a problem in opening the file, deal with it.
if ( !in )
{
// Problem. Figure out what to do with it.
}
// Move to the end of the file.
in.seekg(0, std::ifstream::end);
auto size = in.tellg();
// Allocate memory for the contents.
// Add an additional character for the terminating null character.
contents.resize(size+1);
// Rewind the file.
in.seekg(0);
// Read the contents
auto n = in.read(contents.data(), size);
if ( n != size )
{
// Problem. Figure out what to do with it.
}
contents[size] = '\0';
return contents;
};
PS
Using a terminating null character in the returned object is necessary only if you need to treat the contents of the returned object as a null terminated string for some reason. Otherwise, it maybe omitted.
I can't tell if it's flat out not possible
I can. It's flat out not possible.
Contents of the filename_ string are not determined until runtime - the content is unknown when the pre processor is run. Pre-processor macros are processed before compilation (or as first step of compilation depending on your perspective).
When the choice of the filename is determined at runtime, the file must also be read at runtime (for example using a fstream).
Also, the ideal is to have this as a compile time operation
The latest time you can affect the choice of included file is when the preprocessor runs. What you can use to affect the file is a pre-processor macro:
#define filename_ "path/to/file"
// ...
return
#include filename_
;
it is theoretically possible.
In practice, you're asking to write a PHP construct using C++. It can be done, as too many things can, but you need some awkward prerequisites.
a compiler has to be linked into your executable. Because the operation you call "hardcoding" is essential for the code to be executed.
a (probably very fussy) linker again into your executable, to merge the new code and resolve any function calls etc. in both directions.
Also, the newly imported code would not be reachable by the rest of the program which was not written (and certainly not compiled!) with that information in mind. So you would need an entry point and a means of exchanging information. Then in this block of information you could even put pointers to code to be called.
Not all architectures and OSes will support this, because "data" and "code" are two concerns best left separate. Code is potentially harmful; think of it as nitric acid. External data is fluid and slippery, like glycerine. And handling nitroglycerine is, as I said, possible. Practical and safe are something completely different.
Once the prerequisites were met, you would have two or three nice extra functions and could write:
void *load(const char* filename, void *data) {
// some "don't load twice" functionality is probably needed
void *code = compile_source(filename);
if (NULL == code) {
// a get_last_compiler_error() would be useful
return NULL;
}
if (EXIT_SUCCESS != invoke_code(code, data)) {
// a get_last_runtime_error() would also be useful
release_code(code);
return NULL;
}
// it is now the caller's responsibility to release the code.
return code;
}
And of course it would be a security nightmare, with source code left lying around and being imported into a running application.
Maintaining the code would be a different, but equally scary nightmare, because you'd be needing two toolchains - one to build the executable, one embedded inside said executable - and they wouldn't necessarily be automatically compatible. You'd be crying loud for all the bugs of the realm to come and rejoice.
What problem would be solved?
Implementing require_once in C++ might be fun, but you thought it could answer a problem you have. Which is it exactly? Maybe it can be solved in a more C++ish way.
A better alternative, considering also performances etc., to compile a loadable module beforehand, and load it at runtime.
If you need to perform small tunings to the executable, place parameters into an external configuration file and provide a mechanism to reload it. Once the modules conform to a fixed specification, you can even provide "plugins" that weren't available when the executable was first developed.

What is the right way to parse large amount of input for main function in C++

Suppose there are 30 numbers I had to input into an executable, because of the large amount of input, it is not reasonable to input them via command line. One standard way is to save them into a single XML file and use XML parser like tinyxml2 to parse them. The problem is if I use tinyxml2 to parse the input directly I will have a very bloated main function, which seems to contradict the common good practice.
For example:
int main(int argc, char **argv){
int a[30];
tinyxml2::XMLDocument doc_xml;
if (doc_xml.LoadFile(argv[1])){
std::cerr << "failed to load input file";
}
else {
tinyxml2::XMLHandle xml(&doc_xml);
tinyxml2::XMLHandle a0_xml =
xml.FirstChildElement("INPUT").FirstChildElement("A0");
if (a0_xml.ToElement()) {
a0_xml.ToElement()->QueryIntText(&a[0]);
}
else {
std::cerr << "A0 missing";
}
tinyxml2::XMLHandle a1_xml =
xml.FirstChildElement("INPUT").FirstChildElement("A1");
if (a1_xml.ToElement()) {
a1_xml.ToElement()->QueryIntText(&a[1]);
}
else {
std::cerr << "A1 missing";
}
// parsing all the way to A29 ...
}
// do something with a
return 0;
}
But on the other hand, if I write an extra class just to parse these specific type of input in order to shorten the main function, it doesn't seem to be right either, because this extra class will be useless unless it's used in conjunction with this main function since it can't be reused elsewhere.
int main(int argc, char **argv){
int a[30];
ParseXMLJustForThisExeClass ParseXMLJustForThisExeClass_obj;
ParseXMLJustForThisExeClass_obj.Run(argv[1], a);
// do something with a
return 0;
}
What is the best way to deal with it?
Note, besides reading XML files you can also pass lots of data through stdin. It's pretty common practice to use e.g. mycomplexcmd | hexdump -C, where hexdump is reading from stdin through the pipe.
Now up to the rest of the question: there's a reason to go with the your multiple-functions example (here it's not very important whether they're constructors or usual functions). It's pretty much the same as why would you want any function to be smaller — readability. That said, I don't know about the "common good practice", and I've seen many terminal utilities with very big main().
Imagine someone new is reading 1-st variant of main(). They'd be going through the hoops of figuring out all these handles, queries, children, parents — when all they wanted is to just look at the part after // do something with a. It's because they don't know if it's relevant to their problem or not. But in the 2-nd variant they'll quickly figure it out "Aha, it's the parsing logic, it's not what I am looking for".
That said, of course you can break the logic with detailed comments. But now imagine something went wrong, someone is debugging the code, and they pinned down the problem to this function (alright, it's funny given the function is main(), maybe they just started debugging). The bug turned out to be very subtle, unclear, and one is checking everything in the function. Now, because you're dealing with mutable language, you'd often find yourself in situation where you think "oh, may be it's something with this variable, where it's being changed?"; and you first look up every use of the variable through this large function, then conditions that could lead to blocks where it's changed; then you figuring out what does this another big block, relevant to the condition, that could've been extracted to a separate function, what variables are used in there; and to the moment you figured out what it's doing you already forgot half of what you were looking before!
Of course sometimes big functions are unavoidable. But if you asked the question, it's probably not your case.
Rule of thumb: you see a function doing two different things having little in common, you want to break it to 2 separate functions. In your case it's parsing XML and "doing something with a". Though if that 2-nd part is a few lines, probably not worth extracting — speculate a bit. Don't worry about the overhead, compilers are good at optimizing. You can either use LTO, or you can declare a function in .cpp file only as static (non-class static), and depending on optimization options a compiler may inline the code.
P.S.: you seem to be in the state where it's very useful to learn'n'play with some Haskell. You don't have to use it for real serious projects, but insights you'd get can be applied anywhere. It forces you into better design, in particular you'd quickly start feeling when it's necessary to break a function (aside of many other things).

Removal of unused or redundant code [duplicate]

This question already has answers here:
Listing Unused Symbols
(2 answers)
Closed 7 years ago.
How do I detect function definitions which are never getting called and delete them from the file and then save it?
Suppose I have only 1 CPP file as of now, which has a main() function and many other function definitions (function definition can also be inside main() ). If I were to write a program to parse this CPP file and check whether a function is getting called or not and delete if it is not getting called then what is(are) the way(s) to do it?
There are few ways that come to mind:
I would find out line numbers of beginning and end of main(). I can do it by maintaining a stack of opening and closing braces { and }.
Anything after main would be function definition. Then I can parse for function definitions. To do this I can parse it the following way:
< string >< open paren >< comma separated string(s) for arguments >< closing paren >
Once I have all the names of such functions as described in (2), I can make a map with its names as key and value as a bool, indicating whether a function is getting called once or not.
Finally parse the file once again to check for any calls for functions with their name as in this map. The function call can be from within main or from some other function. The value for the key (i.e. the function name) could be flagged according to whether a function is getting called or not.
I feel I have complicated my logic and it could be done in a smarter way. With the above logic it would be hard to find all the corner cases (there would be many). Also, there could be function pointers to make parsing logic difficult. If that's not enough, the function pointers could be typedefed too.
How do I go about designing my program? Are a map (to maintain filenames) and stack (to maintain braces) the right data structures or is there anything else more suitable to deal with it?
Note: I am not looking for any tool to do this. Nor do I want to use any library (if it exists to make things easy).
I think you should not try to build a C++ parser from scratch, becuse of other said in comments that is really hard. IMHO, you'd better start from CLang libraries, than can do the low-level parsing for you and work directly with the abstract syntax tree.
You could even use crange as an example of how to use them to produce a cross reference table.
Alternatively, you could directly use GNU global, because its gtags command directly generates definition and reference databases that you have to analyse.
IMHO those two ways would be simpler than creating a C++ parser from scratch.
The simplest approach for doing it yourself I can think of is:
Write a minimal parser that can identify functions. It just needs to detect the start and ending line of a function.
Programmatically comment out the first function, save to a temp file.
Try to compile the file by invoking the complier.
Check if there are compile errors, if yes, the function is called, if not, it is unused.
Continue with the next function.
This is a comment, rather than an answer, but I post it here because it's too long for a comment space.
There are lots of issues you should consider. First of all, you should not assume that main() is a first function in a source file.
Even if it is, there should be some functions header declarations before the main() so that the compiler can recognize their invocation in main.
Next, function's opening and closing brace needn't be in separate lines, they also needn't be the only characters in their lines. Generally, almost whole C++ code can be put in a single line!
Furthermore, functions can differ with parameters' types while having the same name (overloading), so you can't recognize which function is called if you don't parse the whole code down to the parameters' types. And even more: you will have to perform type lists matching with standard convertions/casts, possibly considering inline constructors calls. Of course you should not forget default parameters. Google for resolving overloaded function call, for example see an outline here
Additionally, there may be chains of unused functions. For example if a() calls b() and b() calls c() and d(), but a() itself is not called, then the whole four is unused, even though there exist 'calls' to b(), c() and d().
There is also a possibility that functions are called through a pointer, in which case you may be unable to find a call. Example:
int (*testfun)(int) = whattotest ? TestFun1 : TestFun2; // no call
int testResult = testfun(paramToTest); // unknown function called
Finally the code can be pretty obfuscated with #define–s.
Conclusion: you'll probably have to write your own C++ compiler (except the machine code generator) to achieve your goal.
This is a very rough idea and I doubt it's very efficient but maybe it can help you get started. First traverse the file once, picking out any function names (I'm not entirely sure how you would do this). But once you have those names, traverse the file again, looking for the function name anywhere in the file, inside main and other functions too. If you find more than 1 instance it means that the function is being called and should be kept.

Parsing C++ to make some changes in the code

I would like to write a small tool that takes a C++ program (a single .cpp file), finds the "main" function and adds 2 function calls to it, one in the beginning and one in the end.
How can this be done? Can I use g++'s parsing mechanism (or any other parser)?
If you want to make it solid, use clang's libraries.
As suggested by some commenters, let me put forward my idea as an answer:
So basically, the idea is:
... original .cpp file ...
#include <yourHeader>
namespace {
SpecialClass specialClassInstance;
}
Where SpecialClass is something like:
class SpecialClass {
public:
SpecialClass() {
firstFunction();
}
~SpecialClass() {
secondFunction();
}
}
This way, you don't need to parse the C++ file. Since you are declaring a global, its constructor will run before main starts and its destructor will run after main returns.
The downside is that you don't get to know the relative order of when your global is constructed compared to others. So if you need to guarantee that firstFunction is called
before any other constructor elsewhere in the entire program, you're out of luck.
I've heard the GCC parser is both hard to use and even harder to get at without invoking the whole toolchain. I would try the clang C/C++ parser (libparse), and the tutorials linked in this question.
Adding a function at the beginning of main() and at the end of main() is a bad idea. What if someone calls return in the middle?.
A better idea is to instantiate a class at the beginning of main() and let that class destructor do the call function you want called at the end. This would ensure that that function always get called.
If you have control of your main program, you can hack a script to do this, and that's by far the easiet way. Simply make sure the insertion points are obvious (odd comments, required placement of tokens, you choose) and unique (including outlawing general coding practices if you have to, to ensure the uniqueness you need is real). Then a dumb string hacking tool to read the source, find the unique markers, and insert your desired calls will work fine.
If the souce of the main program comes from others sources, and you don't have control, then to do this well you need a full C++ program transformation engine. You don't want to build this yourself, as just the C++ parser is an enormous effort to get right. Others here have mentioned Clang and GCC as answers.
An alternative is our DMS Software Reengineering Toolkit with its C++ front end. DMS, using its C++ front end, can parse code (for a variety of C++ dialects), builds ASTs, carry out full name/type resolution to determine the meaning/definition/use of all symbols. It provides procedural and source-to-source transformations to enable changes to the AST, and can regenerate compilable source code complete with original comments.

finding a function name and counting its LOC

So you know off the bat, this is a project I've been assigned. I'm not looking for an answer in code, but more a direction.
What I've been told to do is go through a file and count the actual lines of code while at the same time recording the function names and individual lines of code for the functions. The problem I am having is determining a way when reading from the file to determine if the line is the start of a function.
So far, I can only think of maybe having a string array of data types (int, double, char, etc), search for that in the line and then search for the parenthesis, and then search for the absence of the semicolon (so i know it isn't just the declaration of the function).
So my question is, is this how I should go about this, or are there other methods in which you would recommend?
The code in which I will be counting will be in C++.
Three approaches come to mind.
Use regular expressions. This is fairly similar to what you're thinking of. Look for lines that look like function definitions. This is fairly quick to do, but can go wrong in many ways.
char *s = "int main() {"
is not a function definition, but sure looks like one.
char
* /* eh? */
s
(
int /* comment? // */ a
)
// hello, world /* of confusion
{
is a function definition, but doesn't look like one.
Good: quick to write, can work even in the face of syntax errors; bad: can easily misfire on things that look like (or fail to look like) the "normal" case.
Variant: First run the code through, e.g., GNU indent. This will take care of some (but not all) of the misfires.
Use a proper lexer and parser. This is a much more thorough approach, but you may be able to re-use an open source lexer/parsed (e.g., from gcc).
Good: Will be 100% accurate (will never misfire). Bad: One missing semicolon and it spews errors.
See if your compiler has some debug output that might help. This is a variant of (2), but using your compiler's lexer/parser instead of your own.
Your idea can work in 99% (or more) of the cases. Only a real C++ compiler can do 100%, in which case I'd compile in debug mode (g++ -S prog.cpp), and get the function names and line numbers from the debug information of the assembly output (prog.s).
My thoughts for the 99% solution:
Ignore comments and strings.
Document that you ignore preprocessor directives (#include, #define, #if).
Anything between a toplevel { and } is a function body, except after typedef, class, struct, union, namespace and enum.
If you have a class, struct or union, you should be looking for method bodies inside it.
The function name is sometimes tricky to find, e.g. in long(*)(char) f(int); .
Make sure your parser works with template functions and template classes.
For recording function names I use PCRE and the regex
"(?<=[\\s:~])(\\w+)\\s*\\([\\w\\s,<>\\[\\].=&':/*]*?\\)\\s*(const)?\\s*{"
and then filter out names like "if", "while", "do", "for", "switch". Note that the function name is (\w+), group 1.
Of course it's not a perfect solution but a good one.
I feel manually doing the parsing is going to be a quite a difficult task. I would probably use a existing tool such as RSM redirect the output to a csv file (assuming you are on windows) and then parse the csv file to gather the required information.
Find a decent SLOC count program, eg, SLOCCounter. Not only can you count SLOC, but you have something against which to compare your results. (Update: here's a long list of them.)
Interestingly, the number of non-comment semicolons in a C/C++ program is a decent SLOC count.
How about writing a shell script to do this? An AWK program perhaps.