Automated source code manipulation with regular expressions

Automated source code manipulation with regular expressions - regex

I would like to be able to automatically process a file with a regular expression and perform a more or less arbitrary action on the match contents. For my most recent need I would like to be able to find every instance of Grid.Row="some int" in a xaml file and increment that row number by one whenever it is larger than X. Yes, for this particular example even though this is legacy code the better approach would be to restructure so this same problem does not need a hack solution the next time around. However, I have encountered the need to do this sort of thing more than once, so I'll ask anyway.
Do any of you know of tools that already exist that would let me do something like this before I go write something simple myself? I googled around for a bit but didn't see anything besides basic regex tools.
Thanks.

I gather that nothing like this exists from the lack of feedback. I built a quick javascript app to fit my needs. If I have time to make it flexible enough I'll update this answer with a github link. I've started using it to do things like start every word with its correct lowercase letter, split camelcase into words for documentation, etc. Really surprised nothing official exists yet.
Thanks anyway.

Related

Calling OpenOffice spell/grammar check from a C/C++ program

The problem is as follows: I'm writing a brute-force decrypter to crack some supersecret code (it's a contest, not a crime), which turned out to be impossible: just too many nodes in the tree that needs to be searched. To overcome this problem, I thought it might be helpful to check the intermediate 'solutions' to see if they produce (parts of) sentences. For example, I might get something like: "jvabaosajbgasgav..." or "lookslikeitsworking....". The first clearly is gibberish and in that case it wouldn't make any sense to continue cracking the code. The second one can easily be identified by eye as a valid English sentence.
I'm not planning on writing my own spell/grammar checker, so I thought it might be possible to call the spell checker from an open source project like OpenOffice or LibreOffice. I checked the openoffice.org website but I couldn't really find out what to do next. Like, how can I link against their libraries? Are these libraries in the SDK? What functions can I use?
The program I'm writing is in pure C, so I probably need to write a wrapper to call their C++ member-functions, right?
Any help is much appreciated!

I believe you'd be vastly more successful integrating with something written with such integration in mind, like the Aspell library.

Move a pattern from after a string to before

I have a large codebase to sort through, so I'd like to automate the process as much as possible. I've already managed to grep out all the lines that are relevant to my task, but I'd like to automate the task itself.
What I'm trying to do is change all trailing increments to leading increments, as in the following example:
i++;
becomes
++i;
I imagine that regexes will be involved, and I'm rather rusty with those. Also, which language would be best for scripting this? I'm working on Windows 7 x64 at the moment, but a platform-agnostic solution would be cool. Also, if you could point me to any specific resources for learning more about this type of problem, that would be awesome.

This would do it...
s/([a-z0-9]+)\+\+/++$1/g
But it's evil, and will break things like:
print "+++++++++Some heading++++++++++++\n";
Not to mention the cases where i++ and ++i actually do different things, and you need to use i++.
My suggestion (assuming you actually have some real reason to think that i++ is wrong in many cases):
rgrep ++ * > fixme.txt
Then open fixme.txt in your favorite text editor, and manually investigate each occurrence, removing it from the text file as you go.

First, see [13.15] Which is more efficient: i++ or ++i? For primitive types like int, there is no gain in efficiency. Assuming you know this, however, the solution depends on what tool you'll use. If you were using Perl, i.e.
perl -pi -w -e 's/search/replace/g;' *.cc
...then this regex would work:
s/\b([a-zA-Z]\w*)\+\+/\+\+$1/g
I agree with #Flimzy though that this seems extremely dangerous and should not be automated.

I've done this before with projects. It's a fine idea provided you have your code base in a repository. Conveniently, that allows you to run the change, then get a project diff dumped into your favorite text editor to see what all changed. A quick look will let you see if anything unexpected got edited, and you can fix it back.
Besides using the regex Flimzy suggested, the other option is to build a tiny lexer/parser to fit this use case. You aren't looking for a full blown compiler, just something that has sufficient states for comments and quotes.
Lex&Yacc 2nd. Ed. has an example lexer for reading c code which is quite small, demonstrating the power of this tool chain. I believe that the examples are available elsewhere as well. If it seems to complex, you can fake it with a simple state machine and regex system built with a perl script.
Finally, you can also check out ply, if you'd rather use python.

Any good C++ refactoring tools that could handle this scenario

I have a large C++ code base that contains a couple of functions for error logging that I'm planning to rewrite, defined as follows;
void LogError(ErrorLevel elvl,LPCTSTR Format,...); // Literal version
void LogError(ErrorLevel elvl,UINT ResourceID,...); // Resource version
I'm planning to rewrite these as a single function
void LogError(ErrNo No,...);
ErrNo in this case is will be an enum, used to look up the rest of the error details from an external file. While I'm using and love Visual Assist, it doesn't appear to be up to this kind of thing. I'm thinking the easiest way to carry out this refactor is to write a small program that uses the results of a search output to find all the occurences of this function, e.g.
c:\cpp\common\Atlas\Comps\LSADJUST.cpp
LSAFormNormalEquations (174): LogError(elvl_Error,IDS_WINWRN0058,i+1,TravObs.setup_no,TravObs.round_no
LSAFormNormalEquations (180): LogError(elvl_Error,IDS_WINWRN0059,i+1,TravObs.setup_no,TravObs.round_no
LSAFormNormalEquations (186): LogError(elvl_Error,IDS_WINWRN0060,i+1,TravObs.setup_no,TravObs.round_no
c:\cpp\common\Atlas\Comps\LSADJUSTZ.CPP
LSAFormNormalEquationsZ (45): LogError(elvl_Note,_T("Adjusting heights by least squares"));
c:\cpp\Win32\Atlas\Section\OptmizeSectionVolumes.cpp
OnSectionOptimizeVolumes (239): LogError(elvl_Note,"Shifted section at chainage %0.1lf by %0.3lf",Graph.c1,Offset);
and then parse and modify the source. Are there any other tools that could simplify this task for me? If looked at a related question which suggets there isn't much out there. I don't mind spending a small amount for a reasonably easy to use tool, but don't have the time or budget for anything more than this.

If you were using Unix, using sed to edit all your source-code might handle most of the changes. You would have to complete some of the changes by hand. I have used this technique in the past.

Searching around for something light weight that met my needs drew a blank, and learning SED while worthwhile would have been a fair amount of work for something that did not quite solve my problem. I ended up writing my own tool to carry out the refactor needed on a seperate copy of the code base until I was happy that it was doing exactly what I needed. This involved taking the output from Visual Assists find all references option, and use it to refactor the code base. I'd post the code, but as it stands is pretty awful and would be liable to fail under a different code base. The general problem can be better stated as something like this
For a C++ code base, find every occurence of function fn taking parameters a,b,...n
Remove the occurence of fn from the source file
Extract the parameters as text variables
Add a few more variables such as instance number, source file name, etc...
At the point where fn was removed, write a formatted string that can include available variables
Append similar formatted strings to one or more external files (e.g. resource files etc...)
I'm guessing the above functionality would be easy enough to implement for someone who is already parsing the source, and will pass a link to this question to Whole Tomato as an enhancement suggestion.
Edit: For anyone interested, there's some follow up on the VA forum here.

svn script to rename member variables on checkout/update

I work with a guy who prefers to name his member variables using the mCamelCase convention, for example: mSomeVar or mSomeOtherVar. I can't stand this format. I prefer the m_camelCase convetion, for example: m_someVar or m_someOtherVar. We drive each other mad looking at each other's code.
Now we've added a new guy to the team and he prefers no prefix at all. Since we use svn, we were thinking we could develop an svn script that renames member variables on the fly when you download the code form the server. This way everyone can get the member variables named whatever they want.
Does anyone have any example svn scripts that can do this sort of thing? I've seen scripts that change comment headers but we need something that includes a C++ processor.

Sounds like a recipe for disaster. In order to get this to work you'd need to decide on a repository wide standard, where every file uses the same variable naming conventions. If you can do that, why not just have everyone code like that?! The important thing about conventions is not which convention you use but that everyone actually uses the same convention!
Find someone with seniority to make the call (or pull rank, if you can), everyone else will just have to suck it up.

I'll assume you want to change the style for research purposes, and the question is simply hypothetical. Joel S. seems to think that answers which simply say your doing something wrong aren't good answers at all, so I'll try to attempt to give you some avenues to approach your problem.
The closest thing which svn does in terms of transforms is it can change the line ending on check-out, and change it again on check-in. This allows the repository to have a single idea of a line ending, and for different clients to have the files modified to their preferences. While this feature sounds fairly straight forward, a number of people seem to have problems getting it to work correctly. Simply Google 'svn eol-style'.
Since svn does not provide any sort of customizable client side filters, I think it would be safe to assume you are going to need to modify the svn client, and compile it for your own purposes. Perhaps you could submit a patch or an extension back to svn.
So at this point, you should have downloaded the svn source, and be able to get it to compile the client. At this point, turn your attention to libsvn_subr/subst.c. This file contains the routines to translate to and from various formats. Currently it does translation on keyword expansion and eol's.
You simply need to create a new property, maybe called member-variable-style. For files which have this flag set, you can invoke a special transform in the subst.c code. You could be able to track down reference in svn to the transform code by looking at the calls to svn_subst_translate_stream3.
OK. That was the easy part. Now you need to get a function to properly translate your code from one form to another. You can't simply pull out the cpp processor out of gcc, because there is no guarantee the code will be valid / compile. You have to do your best job creating lexing rules which hopefully will do the right thing on the right variables. For varaibles starting with m_ or even m, this is fairly straight forward to do. Unfortunately, for the member on your team who doesn't use m_ at all, it can be quite a challenge determining what it a member variable in C++. Fortunately, there exists quite a bit of research in the field done by people who create syntax highlighting code. I'd poke around and find some code which does a good job highlighting C++ code.
Lastly, as these transforms could become quite complicated, I'd suggest you have svn shell out to a filter program at this stage. This wouldn't be great for performance, but it would make it much easier to write and debug your filter. You could then write your filter in Perl or the language of your choice. For a good example of using external an external filter program, see Squid redirectors.
Good luck!

Subversion doesn't have hooks that work on update - you could have a post-commit hook* that would allow you to convert from one convention to another (repository standard), and you could use a script that you write yourself that does the checkout, and then performs the necessary adjustments, but this would give you false readings on svn diff etc.
My suggestion is to just sit down with your colleagues and agree on a standard. The post-commit hook would still be useful to catch slip-ups though.
*I'm thinking something that sees a commit has occurred, automatically checks out and alters the code to adhere to the repository standard convention, and then commits if necessary. Another option is to have a pre-commit hook that disallows the commit if the code doesn't adhere to the standard.

Need refactoring ideas for Arrow Anti-Pattern

I have inherited a monster.
It is masquerading as a .NET 1.1 application processes text files that conform to Healthcare Claim Payment (ANSI 835) standards, but it's a monster. The information being processed relates to healthcare claims, EOBs, and reimbursements. These files consist of records that have an identifier in the first few positions and data fields formatted according to the specs for that type of record. Some record ids are Control Segment ids, which delimit groups of records relating to a particular type of transaction.
To process a file, my little monster reads the first record, determines the kind of transaction that is about to take place, then begins to process other records based on what kind of transaction it is currently processing. To do this, it uses a nested if. Since there are a number of record types, there are a number decisions that need to be made. Each decision involves some processing and 2-3 other decisions that need to be made based on previous decisions. That means the nested if has a lot of nests. That's where my problem lies.
This one nested if is 715 lines long. Yes, that's right. Seven-Hundred-And-Fif-Teen Lines. I'm no code analysis expert, so I downloaded a couple of freeware analysis tools and came up with a McCabe Cyclomatic Complexity rating of 49. They tell me that's a pretty high number. High as in pollen count in the Atlanta area where 100 is the standard for high and the news says "Today's pollen count is 1,523". This is one of the finest examples of the Arrow Anti-Pattern I have ever been priveleged to see. At its highest, the indentation goes 15 tabs deep.
My question is, what methods would you suggest to refactor or restructure such a thing?
I have spent some time searching for ideas, but nothing has given me a good foothold. For example, substituting a guard condition for a level is one method. I have only one of those. One nest down, fourteen to go.
Perhaps there is a design pattern that could be helpful. Would Chain of Command be a way to approach this? Keep in mind that it must stay in .NET 1.1.
Thanks for any and all ideas.

I just had some legacy code at work this week that was similar (although not as dire) as what you are describing.
There is no one thing that will get you out of this. The state machine might be the final form your code takes, but thats not going to help you get there, nor should you decide on such a solution before untangling the mess you already have.
First step I would take is to write a test for the existing code. This test isn't to show that the code is correct but to make sure you have not broken something when you start refactoring. Get a big wad of data to process, feed it to the monster, and get the output. That's your litmus test. if you can do this with a code coverage tool you will see what you test does not cover. If you can, construct some artificial records that will also exercise this code, and repeat. Once you feel you have done what you can with this task, the output data becomes your expected result for your test.
Refactoring should not change the behavior of the code. Remember that. This is why you have known input and known output data sets to validate you are not going to break things. This is your safety net.
Now Refactor!
A couple things I did that i found useful:
Invert if statements
A huge problem I had was just reading the code when I couldn't find the corresponding else statement, I noticed that a lot of the blocks looked like this
if (someCondition)
{
100+ lines of code
{
...
}
}
else
{
simple statement here
}
By inverting the if I could see the simple case and then move onto the more complex block knowing what the other one already did. not a huge change, but helped me in understanding.
Extract Method
I used this a lot.Take some complex multi line block, grok it and shove it aside in it's own method. this allowed me to more easily see where there was code duplication.
Now, hopefully, you haven't broken your code (test still passes right?), and you have more readable and better understood procedural code. Look it's already improved! But that test you wrote earlier isn't really good enough... it only tells you that you a duplicating the functionality (bugs and all) of the original code, and thats only the line you had coverage on as I'm sure you would find blocks of code that you can't figure out how to hit or just cannot ever hit (I've seen both in my work).
Now the big changes where all the big name patterns come into play is when you start looking at how you can refactor this in a proper OO fashion. There is more than one way to skin this cat, and it will involve multiple patterns. Not knowing details about the format of these files you're parsing I can only toss around some helpful suggestions that may or may not be the best solutions.
Refactoring to Patterns is a great book to assist in explainging patterns that are helpful in these situations.
You're trying to eat an elephant, and there's no other way to do it but one bite at a time. Good luck.

A state machine seems like the logical place to start, and using WF if you can swing it (sounds like you can't).
You can still implement one without WF, you just have to do it yourself. However, thinking of it like a state machine from the start will probably give you a better implementation then creating a procedural monster that checks internal state on every action.
Diagram out your states, what causes a transition. The actual code to process a record should be factored out, and called when the state executes (if that particular state requires it).
So State1's execute calls your "read a record", then based on that record transitions to another state.
The next state may read multiple records and call record processing instructions, then transition back to State1.

One thing I do in these cases is to use the 'Composed Method' pattern. See Jeremy Miller's Blog Post on this subject. The basic idea is to use the refactoring tools in your IDE to extract small meaningful methods. Once you've done that, you may be able to further refactor and extract meaningful classes.

I would start with uninhibited use of Extract Method. If you don't have it in your current Visual Studio IDE, you can either get a 3rd-party addin, or load your project in a newer VS. (It'll try to upgrade your project, but you will carefully ignore those changes instead of checking them in.)
You said that you have code indented 15 levels. Start about 1/2-way out, and Extract Method. If you can come up with a good name, use it, but if you can't, extract anyway. Split in half again. You're not going for the ideal structure here; you're trying to break the code in to pieces that will fit in your brain. My brain is not very big, so I'd keep breaking & breaking until it doesn't hurt any more.
As you go, look for any new long methods that seem to be different than the rest; make these in to new classes. Just use a simple class that has only one method for now. Heck, making the method static is fine. Not because you think they're good classes, but because you are so desperate for some organization.
Check in often as you go, so you can checkpoint your work, understand the history later, be ready to do some "real work" without needing to merge, and save your teammates the hassle of hard merging.
Eventually you'll need to go back and make sure the method names are good, that the set of methods you've created make sense, clean up the new classes, etc.
If you have a highly reliable Extract Method tool, you can get away without good automated tests. (I'd trust VS in this, for example.) Otherwise, make sure you're not breaking things, or you'll end up worse than you started: with a program that doesn't work at all.
A pairing partner would be helpful here.

Judging by the description, a state machine might be the best way to deal with it. Have an enum variable to store the current state, and implement the processing as a loop over the records, with a switch or if statements to select the action to take based on the current state and the input data. You can also easily dispatch the work to separate functions based on the state using function pointers, too, if it's getting too bulky.

There was a pretty good blog post about it at Coding Horror. I've only come across this anti-pattern once, and I pretty much just followed his steps.

Sometimes I combine the state pattern with a stack.
It works well for hierarchical structures; a parent element knows what state to push onto the stack to handle a child element, but a child doesn't have to know anything about its parent. In other words, the child doesn't know what the next state is, it simply signals that it is "complete" and gets popped off the stack. This helps to decouple the states from each other by keeping dependencies uni-directional.
It works great for processing XML with a SAX parser (the content handler just pushes and pops states to change its behavior as elements are entered and exited). EDI should lend itself to this approach too.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js