I have to perform refactoring of a medium size code block (< 200K LOC). The scope is pretty moderate: rename some classes, move a few nested definitions up and down the class hierarchy, remove unused stuff.
It would be pretty straightforward to do it by hand but we will have to pick up bug fixes from the older code base for one or two years, and the project will change at least half of lines in the existing code.
So, I am planning to express the changes as a sequence of indent (supposedly astyle), sed script, and another indent.
My plans are: do conversion by hand, then develop the sed script that will yeld the same result. The former part is pretty clear, but developing bit sed script by hand does not seem particularly appealing but I do not have any better idea.
Please, help.
Have a look at the large scale static analysis and refactoring tools that mozilla devs were working on
https://wiki.mozilla.org/Static_Analysis
I'm not sure what has happened since the release of gcc 4.5 - possibly pork and oink are easier to set up now.
sed can probably be cozened into doing it, but for multiline blocks you're better off with something easier to work with. Even awk would be an improvement, but I'd be looking at Perl/Python/scripting language of choice. Preferably with a parser, which would also save you the initial indent run.
In fact, I'd look for a parser that generated an annotated syntax tree, which makes refactoring largely a matter of moving tree branches around.
Related
I really like automatic formatting, not only because it reduces time spent manually formatting code, but especially because it reduces time spent on pointless formatting discussions during code review. Unfortunately, there is often strong resistance to formatting existing code, particularly on projects where the style guide recommends trying to guess/match the bespoke style of every file you edit.
For most projects, agreeing on a style to use going forward and using git-clang-format is probably good enough. Unfortunately, git-clang-format does not do very well on code that uses a mix of spaces, width 8 hard tabs, and width 4 hard tabs.
I think nobody is fond of the mixed indentation, so if we could apply clang formatting in a way that doesn't show up in git diff -w or git blame -w it might be enough to overcome a lot of the objection to formatting existing code. We could then gradually converge on fully formatted code via git-clang-format on subsequent changes.
Worst case, I might be able to configure GNU indent or astyle to indent in a way that is consistent with the chosen .clang-format, but that seems like it would be more error-prone.
Is there a way to run clang-format on C++ code so that it only makes whitespace changes, without changing line wrapping? If not, what is the best way to achieve the same with other tools?
I have a large codebase to sort through, so I'd like to automate the process as much as possible. I've already managed to grep out all the lines that are relevant to my task, but I'd like to automate the task itself.
What I'm trying to do is change all trailing increments to leading increments, as in the following example:
i++;
becomes
++i;
I imagine that regexes will be involved, and I'm rather rusty with those. Also, which language would be best for scripting this? I'm working on Windows 7 x64 at the moment, but a platform-agnostic solution would be cool. Also, if you could point me to any specific resources for learning more about this type of problem, that would be awesome.
This would do it...
s/([a-z0-9]+)\+\+/++$1/g
But it's evil, and will break things like:
print "+++++++++Some heading++++++++++++\n";
Not to mention the cases where i++ and ++i actually do different things, and you need to use i++.
My suggestion (assuming you actually have some real reason to think that i++ is wrong in many cases):
rgrep ++ * > fixme.txt
Then open fixme.txt in your favorite text editor, and manually investigate each occurrence, removing it from the text file as you go.
First, see [13.15] Which is more efficient: i++ or ++i? For primitive types like int, there is no gain in efficiency. Assuming you know this, however, the solution depends on what tool you'll use. If you were using Perl, i.e.
perl -pi -w -e 's/search/replace/g;' *.cc
...then this regex would work:
s/\b([a-zA-Z]\w*)\+\+/\+\+$1/g
I agree with #Flimzy though that this seems extremely dangerous and should not be automated.
I've done this before with projects. It's a fine idea provided you have your code base in a repository. Conveniently, that allows you to run the change, then get a project diff dumped into your favorite text editor to see what all changed. A quick look will let you see if anything unexpected got edited, and you can fix it back.
Besides using the regex Flimzy suggested, the other option is to build a tiny lexer/parser to fit this use case. You aren't looking for a full blown compiler, just something that has sufficient states for comments and quotes.
Lex&Yacc 2nd. Ed. has an example lexer for reading c code which is quite small, demonstrating the power of this tool chain. I believe that the examples are available elsewhere as well. If it seems to complex, you can fake it with a simple state machine and regex system built with a perl script.
Finally, you can also check out ply, if you'd rather use python.
I have large codebase written in HSP(wikipedia article - think "BASIC", but japanese).
By "large" I mean it has 151352 lines of code, 60 source files with total code size of 4.5 megabytes. Also, it has plenty of spaghetti code, no comments and badly needs refactoring. The good thing is that it has a lot of text messages, so not all of those lines represent actual program logic.
I'd like to convert this codebase to C++, while retaining my sanity. "I'd like" means that I'm not required to do it, but I'd strongly prefer to find a method to do it.
What's a good way to do it? Obviously, I can't just rewrite it all in C++ (Well, I could do it in theory, but it would take up to 2 years, and I would introduce many bugs in process), so (I think) a reasonable decision would be to implement code recompiler/preprocessor that would allow me to convert source code into messy C++ (HSP is much simpler than C++, so it should be possible) and then start refactoring/documenting the result.
Unfortunately, i'm not entirely sure how to approach building the recompiler efficiently. While I know there are Lex/Yacc/Bison/Boost::spirit, I haven't used them personally.
So can you recommend a good way perform such conversion?
Any free tool ("free" as in "free beer") that is available on windows platform is allowed, as long as it doesn't affect license of original source code.
Yacc it's targeted to efficiently handle more complex tasks, and it's complex to learn, I think it's overkill.
Spirit should be a better choice, if you already know go with it, personally I would use Prolog for this task.
Prolog has builtin syntax analysis, so called DCG. For a language simple as Basic, I'm pretty sure there are no practical problems in the grammar, and modern Prologs (I think to SWI-Prolog, effectively) can handle complex characters encoding in the source very well.
Also, in Prolog you could try to apply some naivety to unroll the spaghetti code. Doing in general it's a complex task, but could be easy if you have just a small number of patterns, repeated many times.
Pattern matching it's key in such problems...
Well, if you really want to go this way and forget about the advices in the comment, you should probably have a good look at the openhsp compiler, and mostly the codegen file :
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/codegen.cpp
and also have the tokens under your eyes :
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/token.h
http://dev.onionsoft.net/trac/browser/trunk/hspcmp/token.cpp
it seems that HSP is not that complicated, and you can skip the AST step. Though, you could get good optimizations out of that. Don't forget also to prepare a C++ lib to embed your generated code in, so you can manage HSP oddities (like globals, and dynamic typing).
if you can hack something out of that, you'll also have to remove most of what this compiler does (create executable, linkage and stuff). Don't forget, it's a really long and hard task that may not be faster or easier than a full rewrite. But if you're ready, you'll find it out the hard way :)
According to original owner of the codebase, HSP starting with version 3 includes HSP to C code converter. Information is not verified due to lack of time, but this blog article documents the tool called hspcnv which is supposed to convert HSP code into C code. The article is in japanese.
I am using emacs + slime for clojure development. Recently we got a new team member and he does not like emacs, so he installed intellij with la clojure plugin.
Both emacs and intellij allow to automatically re-indent big blocks of code, entire functions and even modules.
This leads to a very annoying problem. If he makes a small change (few lines) and then reindents entire file, then obviously recording it into dvcs (we use darcs) will produce big patch with hundreds rows changed. That makes code review impossible. How do i know which 3 out of hundreds committed lines really changed ?
So now we have collaboration problem. I wonder if there are other clojure teams who use different IDEs. How do you reconcile these problems ?
The options i see are:
Enforce the use of one IDE (emacs). This will solve the problem, but i do not like such an authoritative approach.
Somehow setup both environments to indent identically (not sure if its possible)
Agree to always indent in one IDE. This is cumbersome and prone to errors.
You could try asking him to respect the rest of the team and not automatically reformat everybody's code.
You might also write an external program to do the indenting, and then hook that into your source code system as a pre-commit step. That way it wouldn't matter what editor was used, it would become consistent. I suppose this is a variation on option #2.
Good luck.
If someone chooses to use a different IDE, it's their responsibility to configure it to be a good member of the team. If they don't know how to do so, they probably don't have much of a reason to use it over any other IDE.
However, the configuration options for indentation are located in: Preferences > Code Style
Number 2): One team, one coding standard.
If you can generate a diff from the patch that ignores whitespace changes then your specific problem would go away. +1 for a consistent coding style within the team though.
Let’s say that you decide to change the name of Stack Overflow to Frack Overflow.
Now, in your code you already have dozens of objects and variables and selectors with some variation of the name "Stack". You want them to now be replaced with "Frack".
So my question is, would you opt to run your entire codebase through a regular expression filter and change all of these names? Or would you let them be?
I would use the "rename" feature of a good IDE to do it for me.
It depends, really.
In a language like C++, you can get away with this because the compiler will let you know right away if something would break. However, other less-picky languages will allow you to refer to variables which don't exist, and the worst that happens is a slap on the wrist in the form of an exception being thrown for a null reference.
I was working on a flex project once where the codebase was a real mess, and we decided to go through the code and beautify it a bit to meet the Adobe AS3 coding standards. Since I was new to the project, I didn't realize that the variable names in some classes actually referred to persistent objects which hibernate (running the java webapp for the backend server) was using to create mappings. So renaming these variables caused the entire flex frontend to misbehave, even when we did it with the "correct" refactoring tools in our IDE.
But really, I'd say to check your OCD at the door and make your changes a little at a time. Any time you change dozens of files in a large project, you risk destabilizing it, and in this case, the benefit derived from such a risk doesn't pay off.
I'd first ask myself the question why? It is a risk/reward judgement at the end of the day which only you can make.
I would be very reluctant to do it for stylistic reasons, but for class re-factoring it may be legitimate.
Well, not necessarily a disaster, but it certainly can cause some trouble on large code bases. That's why I hate hungarian notation: it makes you change all of your variable names if you happen to change its type.
If there are objects, members and fields in your solution with names that reference a certain customer implementation, I would work hard to re-factor these to use more generic names instead, and I would let Resharper do the re-naming, not some generic text-search-and-replace tool.
Just use a refactoring tool like Resharper by JetBrains or CodeRush and Refactor! by DevExpress. They change all references of a variable in your entire codebase automatically and can do much more.
I believe Refactor! is even included in the VB version of Visual Studio. I use Resharper and I refuse to develop without it.
If I were using a source code version control system (like svn, git, bazar, mercurial etc) I would not be afraid to refactor my code.
Use some kind of "find replace all" or refactoring of some IDE, compile (if it is not a dynamic language) and run your tests (if any).
If something goes horribly wrong, you can always revert your code using the source control system.
Renaming is perhaps the most common refactoring. It is rather encouraged to refactor your code as you go, as this gives you the flexibility of not having to make permanent decisions about names, code placement, etc. as you are first writing your application. If you are not familiar with the idea, I would suggest you start with the Wikipedia page and then dive into Martin Fowler's site.
However, if you need to write your own regex to rename things, then imho you could use some better tools. Don't waste your time reinventing the wheel -- and then fixing whatever your new wheel broke by accident. If you have the option, use an existing tool (IDE or whatever) to do the dirty work.
Even if you have "dozens" of things to rename, I think you're better off finding them one by one manually, and then using an automatic Rename to fix all instances throughout your code.
You need good justification for doing it, I think. When I make changes that have a large number of potential side effects across a large codebase, which happens from time to time, I usually look for a way to make the compiler fail on spots I've missed. And, if possible, I tend to do it in stages so as to minimize the break.
I wouldn't rename just for the sake of renaming, though.