I really like automatic formatting, not only because it reduces time spent manually formatting code, but especially because it reduces time spent on pointless formatting discussions during code review. Unfortunately, there is often strong resistance to formatting existing code, particularly on projects where the style guide recommends trying to guess/match the bespoke style of every file you edit.
For most projects, agreeing on a style to use going forward and using git-clang-format is probably good enough. Unfortunately, git-clang-format does not do very well on code that uses a mix of spaces, width 8 hard tabs, and width 4 hard tabs.
I think nobody is fond of the mixed indentation, so if we could apply clang formatting in a way that doesn't show up in git diff -w or git blame -w it might be enough to overcome a lot of the objection to formatting existing code. We could then gradually converge on fully formatted code via git-clang-format on subsequent changes.
Worst case, I might be able to configure GNU indent or astyle to indent in a way that is consistent with the chosen .clang-format, but that seems like it would be more error-prone.
Is there a way to run clang-format on C++ code so that it only makes whitespace changes, without changing line wrapping? If not, what is the best way to achieve the same with other tools?
Related
I have large codebase written in HSP(wikipedia article - think "BASIC", but japanese).
By "large" I mean it has 151352 lines of code, 60 source files with total code size of 4.5 megabytes. Also, it has plenty of spaghetti code, no comments and badly needs refactoring. The good thing is that it has a lot of text messages, so not all of those lines represent actual program logic.
I'd like to convert this codebase to C++, while retaining my sanity. "I'd like" means that I'm not required to do it, but I'd strongly prefer to find a method to do it.
What's a good way to do it? Obviously, I can't just rewrite it all in C++ (Well, I could do it in theory, but it would take up to 2 years, and I would introduce many bugs in process), so (I think) a reasonable decision would be to implement code recompiler/preprocessor that would allow me to convert source code into messy C++ (HSP is much simpler than C++, so it should be possible) and then start refactoring/documenting the result.
Unfortunately, i'm not entirely sure how to approach building the recompiler efficiently. While I know there are Lex/Yacc/Bison/Boost::spirit, I haven't used them personally.
So can you recommend a good way perform such conversion?
Any free tool ("free" as in "free beer") that is available on windows platform is allowed, as long as it doesn't affect license of original source code.
Yacc it's targeted to efficiently handle more complex tasks, and it's complex to learn, I think it's overkill.
Spirit should be a better choice, if you already know go with it, personally I would use Prolog for this task.
Prolog has builtin syntax analysis, so called DCG. For a language simple as Basic, I'm pretty sure there are no practical problems in the grammar, and modern Prologs (I think to SWI-Prolog, effectively) can handle complex characters encoding in the source very well.
Also, in Prolog you could try to apply some naivety to unroll the spaghetti code. Doing in general it's a complex task, but could be easy if you have just a small number of patterns, repeated many times.
Pattern matching it's key in such problems...
Well, if you really want to go this way and forget about the advices in the comment, you should probably have a good look at the openhsp compiler, and mostly the codegen file :
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/codegen.cpp
and also have the tokens under your eyes :
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/token.h
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/token.cpp
it seems that HSP is not that complicated, and you can skip the AST step. Though, you could get good optimizations out of that. Don't forget also to prepare a C++ lib to embed your generated code in, so you can manage HSP oddities (like globals, and dynamic typing).
if you can hack something out of that, you'll also have to remove most of what this compiler does (create executable, linkage and stuff). Don't forget, it's a really long and hard task that may not be faster or easier than a full rewrite. But if you're ready, you'll find it out the hard way :)
According to original owner of the codebase, HSP starting with version 3 includes HSP to C code converter. Information is not verified due to lack of time, but this blog article documents the tool called hspcnv which is supposed to convert HSP code into C code. The article is in japanese.
I am using emacs + slime for clojure development. Recently we got a new team member and he does not like emacs, so he installed intellij with la clojure plugin.
Both emacs and intellij allow to automatically re-indent big blocks of code, entire functions and even modules.
This leads to a very annoying problem. If he makes a small change (few lines) and then reindents entire file, then obviously recording it into dvcs (we use darcs) will produce big patch with hundreds rows changed. That makes code review impossible. How do i know which 3 out of hundreds committed lines really changed ?
So now we have collaboration problem. I wonder if there are other clojure teams who use different IDEs. How do you reconcile these problems ?
The options i see are:
Enforce the use of one IDE (emacs). This will solve the problem, but i do not like such an authoritative approach.
Somehow setup both environments to indent identically (not sure if its possible)
Agree to always indent in one IDE. This is cumbersome and prone to errors.
You could try asking him to respect the rest of the team and not automatically reformat everybody's code.
You might also write an external program to do the indenting, and then hook that into your source code system as a pre-commit step. That way it wouldn't matter what editor was used, it would become consistent. I suppose this is a variation on option #2.
Good luck.
If someone chooses to use a different IDE, it's their responsibility to configure it to be a good member of the team. If they don't know how to do so, they probably don't have much of a reason to use it over any other IDE.
However, the configuration options for indentation are located in: Preferences > Code Style
Number 2): One team, one coding standard.
If you can generate a diff from the patch that ignores whitespace changes then your specific problem would go away. +1 for a consistent coding style within the team though.
I have to perform refactoring of a medium size code block (< 200K LOC). The scope is pretty moderate: rename some classes, move a few nested definitions up and down the class hierarchy, remove unused stuff.
It would be pretty straightforward to do it by hand but we will have to pick up bug fixes from the older code base for one or two years, and the project will change at least half of lines in the existing code.
So, I am planning to express the changes as a sequence of indent (supposedly astyle), sed script, and another indent.
My plans are: do conversion by hand, then develop the sed script that will yeld the same result. The former part is pretty clear, but developing bit sed script by hand does not seem particularly appealing but I do not have any better idea.
Please, help.
Have a look at the large scale static analysis and refactoring tools that mozilla devs were working on
https://wiki.mozilla.org/Static_Analysis
I'm not sure what has happened since the release of gcc 4.5 - possibly pork and oink are easier to set up now.
sed can probably be cozened into doing it, but for multiline blocks you're better off with something easier to work with. Even awk would be an improvement, but I'd be looking at Perl/Python/scripting language of choice. Preferably with a parser, which would also save you the initial indent run.
In fact, I'd look for a parser that generated an annotated syntax tree, which makes refactoring largely a matter of moving tree branches around.
If you've got a codebase which is a bit messy in respect to coding standards - a mix of different conventions from different people - is it reasonable to give one person the task of going through every file and bringing it up to meet standards?
As well as being tremendously dull, you're going to get a mass of changes in SVN (or whatever) which can make comparing versions harder. Is it sensible to set someone on the whole codebase, or is it considered stupid to touch a file only to make it meet standards? Should files be left alone until some 'real' change is needed, and then updated?
Tagged as C++ since I think different languages have different automated tools for this.
Should files be left alone until some 'real' change is needed, and then updated?
This is what I would do.
Even if it's primarily text layout changes, doing it by a manual process on a large scale risks breaking code that was working.
Treat it as a refactor and do it locally whenever code has to be touched for some other reason. Add tests if they're missing to improve your chances of not breaking the code.
If your code is already well covered by tests, you might get away with something global, but I still wouldn't advocate it.
I also think this is pretty much language-agnostic.
It also depends on what kind of changes you are planning to make in order to bring it up to your coding standard. Everyone's definition of coding standard is different.
More specifically:
Can your proposed changes be made to the project with 100% guarantee that the entire project will work identically the same as before? For example, changes that only affect comments, line breaks and whitespaces should be fine.
If you do not have 100% guarantee, then there is a risk that should not be taken unless it can be balanced with a benefit. For example, is there a need to gain a deeper understanding of the current code base in order to continue its development, or fix its bugs? Is the jumble of coding conventions preventing these initiatives? If so, evaluate the costs and benefits and decide whether a makeover is justified.
If you need to understand the current code base, here is a technique: tracing.
Make a copy of the code base. Note that tracing involves adding code, so it should not be performed on the production copy.
In the new copy, insert many fprintf (trace) statements into any functions considered critical. It may be possible to automate this.
Run the project with various inputs and collect those tracing results. This will help everyone understand the current project's design.
Another technique for understanding the current code base is to document the dependencies in the project.
Some kinds of dependencies (interface dependency, C++ include dependency, C++ typedef / identifier dependency) can be extracted by automated tools.
Run-time dependency can only be extracted through tracing, or by profiling tools.
I was thinking it's a task you might give a work-experience kid or put out onto RentaCoder
This depends mainly on the codebase's size.
I've seen three trainees given the task to go through a 2MLoC codebase (several thousand source files) in order to insert one new line into the standard disclaimer at the top of all the source files (with the line's content depending on the file's name and path). It took them several days. One of the three used most of that time to write a script that would do it and later only fixed the files where the script had failed to insert the line correctly, the other two just ploughed through the files. (The one who wrote the script later got a job at that company.)
The job of manually adapting all those files in that codebase to certain coding standards would probably have to be measured in man-years.
OTOH, if it's just a few dozen files, it's certainly doable.
Your codebase is very likely somewhere in between, so your best bet might be to set a "work-experience kid" to find out whether there's a tool that can do this to your satisfaction and, if so, make it work.
Should files be left alone until some 'real' change is needed, and then updated?
I'd strongly advice against this. If you do this, you will have "real" changes intermingled with whatever reformatting took place, making it nigh impossible to see the "real" changes in the diff.
You can address the formatting aspect of coding style fairly easily. There are a number of tools that can auto-format your code. I recommend hooking one of these up to your version control tool's "check in" feature. This way, people can use whatever format they want while editing their code, but when it gets checked in, it's reformatted to the official style.
In general, I think it's best if you can do the big change all at once. In the past, we've done the following:
1. have a time dedicated to the reformatting when most people aren't working (e.g. at night or on the weekend
2. have a person check out as many files as possible at that time, reformat them, and check them in again
With a reformatting-only revision, you don't have to figure out what has changed in addition to the formatting.
I've been helping augment a twenty-some year old proprietary language within my company. It is a large, Turing-complete language. Translating it to another grammar regime (such as Antlr) is not an option (I don't get to decide this).
For the most part, extending the grammar has gone smoothly. But every once in awhile I'll get a reduce-reduce or shift-reduce that
is difficult to eliminate
sometimes just doesn't make sense (to my feeble brain)
After a lot of painful staring at y.output files and experimental grammar refactorings, I've usually gotten where I wanted to go. Sometimes I've had to make unsatisfactory compromises.
So, are there any tools out there which can suck in a yacc grammar, which enhance browsing, experimenting, and allow debugging of changes?
If I add a production, I'd like to see more than "atomic production that is used everywhere" (think identifier) "conflicts with rule foo" (yes, there is more info, s/r, r/r, than that, but I think you get my drift). It would be nice to have some hint of the interplay beyond putting on my thinking cap and trying to imagine a symbol stack and state machine.
Update: I guess I should clarify. We use Berkeley Yacc. I have been testing using a recent version of Bison. For output, I've compiled the grammar with --report=itemset.
My goal with this post is to seek out external tools which augment the grammar debugging facilities which ship with yacc. It's painful today with the default set. Help me find better interactive tools, such as those you can use with Antlr.
You might get some help from yacc -d, which produces debugging output -- it basically gives a full listing of the symbol stack states and such. The output is dense and voluminous, so trying to read all of it directly rarely accomplishes much (never has for me anyway). However, when you make a change the gives (for example) an r/r conflict, you can run yacc -d on the old grammar and the new one, then run diff on the results, to get a much more detailed run-down on what change(s) caused the conflict.
It's probably worth noting, however, that s/r conflicts are often benign -- unless you're fairly sure it's a problem, trying to "fix" it often isn't worthwhile. The same is not true with r/r conflicts though. While these are sometimes benign, it's comparatively rare.
Edit: Oops -- sorry, that should be -v. You mention y.output, so you apparently already know how to do that part. The point is that you don't try to look at the y.output files directly, but do a diff between the one that came out cleanly and the one that didn't to get some detail about the actual conflict (without staring at 10 jillion lines of "stuff" that's just fine.
This is the best I got:
http://tldp.org/HOWTO/Lex-YACC-HOWTO-7.html