I've been helping augment a twenty-some year old proprietary language within my company. It is a large, Turing-complete language. Translating it to another grammar regime (such as Antlr) is not an option (I don't get to decide this).
For the most part, extending the grammar has gone smoothly. But every once in awhile I'll get a reduce-reduce or shift-reduce that
is difficult to eliminate
sometimes just doesn't make sense (to my feeble brain)
After a lot of painful staring at y.output files and experimental grammar refactorings, I've usually gotten where I wanted to go. Sometimes I've had to make unsatisfactory compromises.
So, are there any tools out there which can suck in a yacc grammar, which enhance browsing, experimenting, and allow debugging of changes?
If I add a production, I'd like to see more than "atomic production that is used everywhere" (think identifier) "conflicts with rule foo" (yes, there is more info, s/r, r/r, than that, but I think you get my drift). It would be nice to have some hint of the interplay beyond putting on my thinking cap and trying to imagine a symbol stack and state machine.
Update: I guess I should clarify. We use Berkeley Yacc. I have been testing using a recent version of Bison. For output, I've compiled the grammar with --report=itemset.
My goal with this post is to seek out external tools which augment the grammar debugging facilities which ship with yacc. It's painful today with the default set. Help me find better interactive tools, such as those you can use with Antlr.
You might get some help from yacc -d, which produces debugging output -- it basically gives a full listing of the symbol stack states and such. The output is dense and voluminous, so trying to read all of it directly rarely accomplishes much (never has for me anyway). However, when you make a change the gives (for example) an r/r conflict, you can run yacc -d on the old grammar and the new one, then run diff on the results, to get a much more detailed run-down on what change(s) caused the conflict.
It's probably worth noting, however, that s/r conflicts are often benign -- unless you're fairly sure it's a problem, trying to "fix" it often isn't worthwhile. The same is not true with r/r conflicts though. While these are sometimes benign, it's comparatively rare.
Edit: Oops -- sorry, that should be -v. You mention y.output, so you apparently already know how to do that part. The point is that you don't try to look at the y.output files directly, but do a diff between the one that came out cleanly and the one that didn't to get some detail about the actual conflict (without staring at 10 jillion lines of "stuff" that's just fine.
This is the best I got:
http://tldp.org/HOWTO/Lex-YACC-HOWTO-7.html
Related
I have large codebase written in HSP(wikipedia article - think "BASIC", but japanese).
By "large" I mean it has 151352 lines of code, 60 source files with total code size of 4.5 megabytes. Also, it has plenty of spaghetti code, no comments and badly needs refactoring. The good thing is that it has a lot of text messages, so not all of those lines represent actual program logic.
I'd like to convert this codebase to C++, while retaining my sanity. "I'd like" means that I'm not required to do it, but I'd strongly prefer to find a method to do it.
What's a good way to do it? Obviously, I can't just rewrite it all in C++ (Well, I could do it in theory, but it would take up to 2 years, and I would introduce many bugs in process), so (I think) a reasonable decision would be to implement code recompiler/preprocessor that would allow me to convert source code into messy C++ (HSP is much simpler than C++, so it should be possible) and then start refactoring/documenting the result.
Unfortunately, i'm not entirely sure how to approach building the recompiler efficiently. While I know there are Lex/Yacc/Bison/Boost::spirit, I haven't used them personally.
So can you recommend a good way perform such conversion?
Any free tool ("free" as in "free beer") that is available on windows platform is allowed, as long as it doesn't affect license of original source code.
Yacc it's targeted to efficiently handle more complex tasks, and it's complex to learn, I think it's overkill.
Spirit should be a better choice, if you already know go with it, personally I would use Prolog for this task.
Prolog has builtin syntax analysis, so called DCG. For a language simple as Basic, I'm pretty sure there are no practical problems in the grammar, and modern Prologs (I think to SWI-Prolog, effectively) can handle complex characters encoding in the source very well.
Also, in Prolog you could try to apply some naivety to unroll the spaghetti code. Doing in general it's a complex task, but could be easy if you have just a small number of patterns, repeated many times.
Pattern matching it's key in such problems...
Well, if you really want to go this way and forget about the advices in the comment, you should probably have a good look at the openhsp compiler, and mostly the codegen file :
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/codegen.cpp
and also have the tokens under your eyes :
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/token.h
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/token.cpp
it seems that HSP is not that complicated, and you can skip the AST step. Though, you could get good optimizations out of that. Don't forget also to prepare a C++ lib to embed your generated code in, so you can manage HSP oddities (like globals, and dynamic typing).
if you can hack something out of that, you'll also have to remove most of what this compiler does (create executable, linkage and stuff). Don't forget, it's a really long and hard task that may not be faster or easier than a full rewrite. But if you're ready, you'll find it out the hard way :)
According to original owner of the codebase, HSP starting with version 3 includes HSP to C code converter. Information is not verified due to lack of time, but this blog article documents the tool called hspcnv which is supposed to convert HSP code into C code. The article is in japanese.
If you've got a codebase which is a bit messy in respect to coding standards - a mix of different conventions from different people - is it reasonable to give one person the task of going through every file and bringing it up to meet standards?
As well as being tremendously dull, you're going to get a mass of changes in SVN (or whatever) which can make comparing versions harder. Is it sensible to set someone on the whole codebase, or is it considered stupid to touch a file only to make it meet standards? Should files be left alone until some 'real' change is needed, and then updated?
Tagged as C++ since I think different languages have different automated tools for this.
Should files be left alone until some 'real' change is needed, and then updated?
This is what I would do.
Even if it's primarily text layout changes, doing it by a manual process on a large scale risks breaking code that was working.
Treat it as a refactor and do it locally whenever code has to be touched for some other reason. Add tests if they're missing to improve your chances of not breaking the code.
If your code is already well covered by tests, you might get away with something global, but I still wouldn't advocate it.
I also think this is pretty much language-agnostic.
It also depends on what kind of changes you are planning to make in order to bring it up to your coding standard. Everyone's definition of coding standard is different.
More specifically:
Can your proposed changes be made to the project with 100% guarantee that the entire project will work identically the same as before? For example, changes that only affect comments, line breaks and whitespaces should be fine.
If you do not have 100% guarantee, then there is a risk that should not be taken unless it can be balanced with a benefit. For example, is there a need to gain a deeper understanding of the current code base in order to continue its development, or fix its bugs? Is the jumble of coding conventions preventing these initiatives? If so, evaluate the costs and benefits and decide whether a makeover is justified.
If you need to understand the current code base, here is a technique: tracing.
Make a copy of the code base. Note that tracing involves adding code, so it should not be performed on the production copy.
In the new copy, insert many fprintf (trace) statements into any functions considered critical. It may be possible to automate this.
Run the project with various inputs and collect those tracing results. This will help everyone understand the current project's design.
Another technique for understanding the current code base is to document the dependencies in the project.
Some kinds of dependencies (interface dependency, C++ include dependency, C++ typedef / identifier dependency) can be extracted by automated tools.
Run-time dependency can only be extracted through tracing, or by profiling tools.
I was thinking it's a task you might give a work-experience kid or put out onto RentaCoder
This depends mainly on the codebase's size.
I've seen three trainees given the task to go through a 2MLoC codebase (several thousand source files) in order to insert one new line into the standard disclaimer at the top of all the source files (with the line's content depending on the file's name and path). It took them several days. One of the three used most of that time to write a script that would do it and later only fixed the files where the script had failed to insert the line correctly, the other two just ploughed through the files. (The one who wrote the script later got a job at that company.)
The job of manually adapting all those files in that codebase to certain coding standards would probably have to be measured in man-years.
OTOH, if it's just a few dozen files, it's certainly doable.
Your codebase is very likely somewhere in between, so your best bet might be to set a "work-experience kid" to find out whether there's a tool that can do this to your satisfaction and, if so, make it work.
Should files be left alone until some 'real' change is needed, and then updated?
I'd strongly advice against this. If you do this, you will have "real" changes intermingled with whatever reformatting took place, making it nigh impossible to see the "real" changes in the diff.
You can address the formatting aspect of coding style fairly easily. There are a number of tools that can auto-format your code. I recommend hooking one of these up to your version control tool's "check in" feature. This way, people can use whatever format they want while editing their code, but when it gets checked in, it's reformatted to the official style.
In general, I think it's best if you can do the big change all at once. In the past, we've done the following:
1. have a time dedicated to the reformatting when most people aren't working (e.g. at night or on the weekend
2. have a person check out as many files as possible at that time, reformat them, and check them in again
With a reformatting-only revision, you don't have to figure out what has changed in addition to the formatting.
I got a task related to ANCIENT C++ project which hasn't any documentation, comments at all and all code/variables is written in foreign language. Do I have a chance to analyze this code in a 1 working day and make a design/UML to create new features? I have been sitting around for 3 hours already and I feel so frustrated... Maybe somebody also had same problem? Any advice?
BR,
I suspect the biggest issue may be the fact that it's in a foreign language. You can use various static code analysis tools to try and understand what's going on, but if everything is presented in an unfamiliar language then that's still no use. Your first step (I believe) is to find someone who can speak this language and get them to translate as you go...
1) Use Doxygen , You can configure doxygen to extract the code structure from undocumented source files.
2) Use source Insight, Source Insight is an advanced code editor and browser with built-in analysis for C/C++, C#, and Java programs
Short answer, no - you probably don't have a chance to understand the code in one day. Reading/maintaining code is one of the hardest things to do, especially when it's lacking documentation. The fact that the code is in a foreign language (!) makes it even harder.
Sounds like you are on a very restricted (unrealistic) time-budget, but Working With Legacy Software is a good book if you're working with legacy systems. If you are planning to keep adding new features to the legacy system it's your responsibility to make your management aware of the scope of the operation. Or at least try.
Under this time constraint (1 day) it may or may not be doable depending on the size of the project - if its a few hundred lines of code then for sure. If its a serious project with several tens of thousands code lines, then likely no.
The first thing you need to know is what is this program supposed to do at all. If you have no idea what it does and how it does it, then analyzing the code will give you the answer but it will be a long and frustrating task. So my first suggestion would be to get yourself familiar with the outer workings of the software - what does it supposed to do and generally how it is supposed to do it. If you are doing it as part as your work then you should be able to get someone to walk you through using the program - even if its UI is in a foreign language (which I hope it doesn't, even if the code is written by a foreign language speaker).
Once you know what the software is attempting to do, then it should be fairly straight forward (even if lengthy and daunting) to rewrite all the comments in your own language for you to understand. I suggest doing so in a bottoms-up approach: its easier to understand the small and trivial things a program does, then to understand the top-level logic - and a lot of trivial things in order make up the logic of the software.
Only once you understand - to a large degree, anyway - the inner workings of the program you may write its functional spec and work on features.
Non-free way on Windows:
You can use CppDepend. This application is able to parse your visual project or your source files. It gives you a lot of information like dependency trees. You can try the trial (Maybe it will be enough for what you have to do).
Free way multi-platform:
You can use doxygen with a special configuration (extract code structure from undocumented code) and analyze the result.
I was quite happy with a tool called Understand (15-day eval license available) for this kind of task. However, I agree with Guss that the time you'll need depends a lot on the size of the code, and one day is probably just enough for a small program.
cscope & ctags are a must when I do my own code, and even more when looking to other's code.
You may also try this ::
http://www.sgvsarc.com/product_crystalflow.htm
How much time would it take to write a C++ compiler using lex/yacc?
Where can I get started with it?
There are many parsing rules that cannot be parsed by a bison/yacc parser (for example, distinguishing between a declaration and a function call in some circumstances). Additionally sometimes the interpretation of tokens requires input from the parser, particularly in C++0x. The handling of the character sequence >> for example is crucially dependent on parsing context.
Those two tools are very poor choices for parsing C++ and you would have to put in a lot of special cases that escaped the basic framework those tools rely on in order to correctly parse C++. It would take you a long time, and even then your parser would likely have weird bugs.
yacc and bison are LALR(1) parser generators, which are not sophisticated enough to handle C++ effectively. As other people have pointed out, most C++ compilers now use a recursive descent parser, and several other answers have pointed at good solutions for writing your own.
C++ templates are no good for handling strings, even constant ones (though this may be fixed in C++0x, I haven't researched carefully), but if they were, you could pretty easily write a recursive descent parser in the C++ template language. I find that rather amusing.
It sounds like you're pretty new to parsing/compiler creation. If that's the case, I'd highly recommend not starting with C++. It's a monster of a language.
Either invent a trivial toy language of your own, or do something modeled on something much smaller and simpler. I saw a lua parser where the grammar definition was about a page long. That'd be much more reasonable as a starting point.
It will probably take you years, and you'll probably switch to some other parser generator in the process.
Parsing C++ is notoriously error-prone. The grammar is not fully LR-parsable, as many parts are context-sensitive. You won't be able to get it working right in flex/yacc, or at least it'll be really awkward to implement. There are only two front-ends I know of that get it right. Your best bet is to use one of these and focus on writing the back-end. That's where the interesting stuff is anyway :-).
Existing C++ Front Ends:
The EDG front-end is used by most of the commercial vendors (Intel, Portland Group, etc.) in their compilers. It costs money, but it's very thorough. People pay big bucks for it because they don't want to deal with the pain of writing their own C++ parser.
GCC's C++ front-end is thorough enough for production code, but you'd have to figure out how to integrate this into your project. I believe it's fairly involved to separate it from GCC. This would also be GPL, but I'm not sure whether that's a problem for you. You can use the GCC front-end in your project via gcc_xml, but this will only give you XML for classes, functions, namespaces, and typedefs. It won't give you a syntax tree for the code.
Another possibility is to use clang, but their C++ support is currently spotty. It'll be nice to see them get all the bugs out, but if you look at their C++ status page you'll notice there are more than a few test cases that still break. Take heed -- clang is a big project. If it's taking these guys years to implement a C++ front-end, it's going to take you longer.
Others have mentioned ANTLR, and there is a C++ grammar available for it, but I'm skeptical. I haven't heard of an ANTLR front end being used in any major compilers, though I do believe it's used in the NetBeans IDE. It might be suitable for an IDE, but I'm skeptical that you'd be able to use it on production code.
A long time, and lex and yacc won't help
If you have the skills to write a compiler for such a large language, you will not need the small amount of help that lex and yacc give you. In fact, while lex is OK it may take longer to use yacc, as it's not really quite powerful enough for C or C++, and you can end up spending far more time getting it to work right than it would take to just write a recursive descent parser.
I believe lex and yacc are best used for simple grammars, or when it is worth the extra effort to have a nicely readable grammar file, perhaps because the grammar is experimental and subject to change.
For that matter, the entire parser is possibly not the major part of your job, depending on exactly what goals you have for the code generator.
As others have already said, yacc is a poor choice for implementing a C++ parser. One can do it; the orginal GCC did so, before the GCC team got disgusted with how hard it was to maintain and extend. (Flex might be OK as a lexer).
Some say recursive descent parsers are best, because Bjarne Stroustrop said so. Our experience is the GLR parsing is the right answer for this, and our GLR-based C++ front end is a nice proof, as is the Elsa front end. Our front end has been used in anger on millions of lines of C++ (including Microsoft and GCC dialects) to carry out program analyses and massive source code transformation.
But what is not emphasized enough is that parsing is just a very small portion of what it takes to build a compiler, especially for C++. You need to also build symbol tables ("what does this identifier mean in this context?") and to do that you need to encode essentially most of several hundred pages of the C++ standard. We believe that the foundation on which we build compiler-like tools, DMS, is extremely good for doing this, and it took us over a man-year to get just this part right.
But then you have the rest of the compiler to consider:
Preprocessor
AST construction
Semantic analysis and type checking
Control, Data flow, and pointer analysis
Basic code generation
Optimizations
Register allocation
Final Code Generation
Debugging support
I keep saying this: building a parser (the BNF part) for a language is like climbing the foothills of the Himalayas. Building a full compiler is like climbing Everest. Pretty much any clod can do the former (although C++ is right at the edge). Only the really serious do the latter, and only when extremely well prepared.
Expect building a C++ compiler to take you years.
(The SD C++ front end handles lexing, parsing, AST generation, symbol tables, some type checking, and regeneration of compilable source text from the AST, including the original comments, for the major C++ dialects. It has been developed over a period of some 6 years).
EDIT: May, 2015. The original answer was written in 2010; we now have 11 years invested, taking us up through C++14. The point is that it is an endless, big effort to build one of these.
Firstly, the "flex" tag on SO is about Adobe's product, not the lexer generator. Secondly, Bjarne Stroustrup is on record as saying he wished he had implemented Cfront (the first C++ compiler) using recursive descent rather than a table driven tool. And thirdly, to answer your question directly - lots. If you feel you need to write one, take a look at ANTLR - not my favourite tool, but there are already C++ parsers for it.
This is a non-trivial problem, and would quite a lot of time to do correctly. For one thing, the grammar for C++ is not completely parseable by a LALR parser such as yacc. You can do subsets of the language, but getting the entire language specification correct is tricky.
You're not the first person to think that this is fun. Here's a nice blog-style article on the topic:
Parsing C++
Here's an important quote from the article:
"After lots of investigation, I
decided that writing a
parser/analysis-tool for C++ is
sufficiently difficult that it's
beyond what I want to do as a hobby."
The problem with that article is that it's a bit old, and several of the links are broken. Here are some links to some other resources on the topic of writing C++ parsers:
ANTLR Grammars (contain several grammars for C++)
A YACC-able C++ 2.1 Grammar and the resulting ambiguities
Parsing and Processing C++ Code (Wikipedia)
Lex,yacc will not be enough. You need a linker, assembler too.., c preprocessor.
It depends on how you do it.
How much pre-made components do you plan to use?
You need to get the description of the syntax and its token from somewhere.
For example, if you use LLVM, you can proceed faster. It already provides a lot of tools, assembler, linker, optimiser....
You can get a c preprocessor from boost project..
You need to create a test suite to test your compiler automatically.
It can take a year if you work on it each day or much less you have more talent and motivation.
Unless you have already written several other compilers; C++ is not a language you even want to start writing a compiler from scratch for, the language has a lot of places were the meaning requires a lot of context before the situation can be disambiguated.
Even if you have lots of experience writing compilers you are looking at several years for a team of developers. This is just to parse the code correctly into an intermediate format. Writing the backend to generate code is yet another specialized task (though you could steal the gcc backend).
If you do a google for "C++ grammars" there are a couple around to get you started.
C++ LEX Tokens: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxLexer.l
C++ YACC Grammer: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxTester.y
A few years - if you can get research grant to re-write new lex/yacc :-)
People keep chasing their tails on this a lot - starting with Stroustrup who was always fancied being a language "designer" rather than actual compiler writer (remember that his C++ was a mere codegen for ages andwould still be there if it wasn't for gcc and other folks).
The core issue is that real research on parser generators pretty much ceased to exist ever since CPU-s became fast enough to handle functional languages and brute-force recursive descent. Recursive descent is the last resort when you don't know what to do - it does exhaustive search till it nabs one "rule" that fires. Once you are content with that you kind of loose interest in researching how to do it efficiently.
What you'd essentially need is a reasonable middle-ground - like LALR(2) with fixed, limited backtraching (plus static checker to yell if "desiogner" splurges into a nondeterministic tree) and also limited and partitioned symbol table feedback (modern parser need to be concurrency-friendly).
Sounds like a research grant proposal, doesn't it :-) Now if we'd find someone to actually fund it, that would be something :-))
A C++ compiler is very complicated. To implement enough of C++ to be compatible with most C++ code out there would take several developers a couple of years full time. clang is a compiler project being funded by Apple to develop a new compiler for C, C++, and Objective-C, with several full-time developers, and the C++ support is still very far from being complete after a couple of years of development.
Recursive decent is a good choice to parse C++. GCC and clang use it.
The Elsa parser (and my ellcc compiler) use the Elkhound GLR compiler generator.
In either case, writing a C++ compiler is a BIG job.
Well, what do you mean by write a compiler?
I doubt any one guy has made a true C++ compiler that took it down all the way to assembly code, but I have used lex and yacc to make a C compiler and I have done it without.
Using both you can make a compiler that leaves out the semantics in a couple days, but figuring out how to use them can take weeks or months easily. Figuring out how to make a compiler at all will take weeks or months no matter what, but the figure I remember is once you know how it works it took a few days with lex and yacc and a few weeks without but the second had better results and fewer bugs so really it's questionable whether they are worth using at all.
The 'semantics' is the actual code production. That can be very simple code that's just enough to work and might not take long at all, or you could spend your whole life doing optimization on it.
With C++ the big issue is templates, but there's so many little issues and rules I can't imagine someone ever wanting to do this. Even if you DO finish, the problem is you won't necessarily have binary compatibility ie be able to be recognized as a runnable program by a linker or the OS because there's more to it than just C++ and its hard to pin down standard but there's also yet more standards to worry about which are even less widely available.
Let’s say that you decide to change the name of Stack Overflow to Frack Overflow.
Now, in your code you already have dozens of objects and variables and selectors with some variation of the name "Stack". You want them to now be replaced with "Frack".
So my question is, would you opt to run your entire codebase through a regular expression filter and change all of these names? Or would you let them be?
I would use the "rename" feature of a good IDE to do it for me.
It depends, really.
In a language like C++, you can get away with this because the compiler will let you know right away if something would break. However, other less-picky languages will allow you to refer to variables which don't exist, and the worst that happens is a slap on the wrist in the form of an exception being thrown for a null reference.
I was working on a flex project once where the codebase was a real mess, and we decided to go through the code and beautify it a bit to meet the Adobe AS3 coding standards. Since I was new to the project, I didn't realize that the variable names in some classes actually referred to persistent objects which hibernate (running the java webapp for the backend server) was using to create mappings. So renaming these variables caused the entire flex frontend to misbehave, even when we did it with the "correct" refactoring tools in our IDE.
But really, I'd say to check your OCD at the door and make your changes a little at a time. Any time you change dozens of files in a large project, you risk destabilizing it, and in this case, the benefit derived from such a risk doesn't pay off.
I'd first ask myself the question why? It is a risk/reward judgement at the end of the day which only you can make.
I would be very reluctant to do it for stylistic reasons, but for class re-factoring it may be legitimate.
Well, not necessarily a disaster, but it certainly can cause some trouble on large code bases. That's why I hate hungarian notation: it makes you change all of your variable names if you happen to change its type.
If there are objects, members and fields in your solution with names that reference a certain customer implementation, I would work hard to re-factor these to use more generic names instead, and I would let Resharper do the re-naming, not some generic text-search-and-replace tool.
Just use a refactoring tool like Resharper by JetBrains or CodeRush and Refactor! by DevExpress. They change all references of a variable in your entire codebase automatically and can do much more.
I believe Refactor! is even included in the VB version of Visual Studio. I use Resharper and I refuse to develop without it.
If I were using a source code version control system (like svn, git, bazar, mercurial etc) I would not be afraid to refactor my code.
Use some kind of "find replace all" or refactoring of some IDE, compile (if it is not a dynamic language) and run your tests (if any).
If something goes horribly wrong, you can always revert your code using the source control system.
Renaming is perhaps the most common refactoring. It is rather encouraged to refactor your code as you go, as this gives you the flexibility of not having to make permanent decisions about names, code placement, etc. as you are first writing your application. If you are not familiar with the idea, I would suggest you start with the Wikipedia page and then dive into Martin Fowler's site.
However, if you need to write your own regex to rename things, then imho you could use some better tools. Don't waste your time reinventing the wheel -- and then fixing whatever your new wheel broke by accident. If you have the option, use an existing tool (IDE or whatever) to do the dirty work.
Even if you have "dozens" of things to rename, I think you're better off finding them one by one manually, and then using an automatic Rename to fix all instances throughout your code.
You need good justification for doing it, I think. When I make changes that have a large number of potential side effects across a large codebase, which happens from time to time, I usually look for a way to make the compiler fail on spots I've missed. And, if possible, I tend to do it in stages so as to minimize the break.
I wouldn't rename just for the sake of renaming, though.