Variable renaming for plagiarism detection for C/C++

Variable renaming for plagiarism detection for C/C++ - c++

I have a couple of simple C++ homeworks and I know the students shared code. These are smart students and they know how to cheat moss. I'm looking for a tool that can rename variables based on their types (first variable of type int will be int1, first int array will be intptr1...), or does something similar that I cannot think of now. Do you know a quick way to do this?
edit: I'm required to use moss and report 90% match
Thanks

Yep, the tool you're looking for is called a compiler. :)
Seriously, if the programs submitted are exactly the same except for the identifier names, compiling then (without debugging info) should result in exactly the same output.
If you do this with debugging turned on, the compiler may leave meta-data in the executable that is different for each executable, hence the comment about ensuring it is off. This is also why this wont work for Java programs - that kind of info is present whether in debug mode or not (for the purposes of dynamic introspection).
EDIT: I see from the comments added to the question that you're observing some submissions that are different in more than just identifier names. If the programs are still structurally equivalent, this should still work.
EDIT: Given that the use of moss is a requirement, this probably isn't the way to go. I does seem though that moss has some support for comparing assembly - perhaps compiling to assembler and submitting that to moss is an option (depending on what compiler you're using).

You can download and try our C CloneDR duplicate code detector. It finds duplicated code even when the variable names have been changed. Multiple changes in the same chunk are treated as just one; if they rename the varaibles consistenly everywhere, you'll get back a report of "one clone" with the precise variable subsitution.

You can try Copy Paste Detector with ignoreIdentifiers turned on. You can at least use it for a first pass before going to the effort of normalizing names for moss. Or, since the source is available, maybe you can get it to spit out its internal normalization of the code.

Another way of doing this would be to compile the applications and compare their binaries, so your examination is not limited to variable/function name changing.
An HEX editor can help you with that. I just tried ExamDiff (not free $) and I was happy with the result.

Related

How to export function names and variable names using GCC or clang?

I am making a commercial software and I don't want for it to be easily crackable. It is targeted for Linux and I am compiling it using GCC (8.2.1). The problem is that when I compile it, technically anyone can use disassembler like IDA or Binary Ninja to see all functions names. Here is example (you can see function names on left panel):
Is there any way to protect my program from this kind if reverse engineering? Is there any way of exporting all if these function names and variables from code automatically (with GCC or clang?), so I can make a simple script to change them to completely random before compilation?

So you want to hide/mask the names of symbols in your binary. You've decided that, to do this, you need to get a list of them so that you can create a script to modify them. Well, you could get that list with nm but you don't need any of that (rewriting names inside a compiled binary? oof… recipe for disaster).
Instead, just do what everybody does in a release build and strip the symbols! You'll see a much smaller binary, too. Of course this doesn't prevent reverse engineering (nothing does), though it arguably makes said task more difficult.
Honestly you should be stripping your release binaries anyway, and not to prevent cracking. Common wisdom is not to try too hard to prevent cracking, because you'll inevitably fail, and at the cost of wasted dev time in the attempt (and possibly a more complex codebase that's harder to maintain / a more complex executable that is less fast and/or useful for the honest customer).

Changing parts of compiled binaries

learned english as a second lang, sorry for the mistakes & awkwardness
I have given a peculiar project to work on. The company has lost the source code for the app, and I have to make changes to it. Now, reverse engineering the whole thing is impossible for one man, its just too huge, however patching individual functions would be feasible, since the changes are not that monumental.
So, one possible solution would be compiling C code and somehow -after rewriting addresses- patching it into the actual binary, ideally, replacing the code the CALL instruction jumps to, or inserting a JMP to my code.
Is there any way to accomplish this using MingW32? If it is, can you provide a simple example? I'm also interested in books which could help me accomplishing the task.
Thanks for your help

I use OllyDBG for this kind of things. It allows you to see the disassembly and debug it, you can place breakpoints etc, and you can also edit the binary. So, you could edit the PE header of that program adding a code section with your (compiled) code inside, then call it from the original program.
I can't give you any advice since I've never tried, although I thought about it many times. You know, lazyness.. :)

I would disassemble the program with a high-quality disassembler that produces something that can be assembled back into a runnable app, and then replace the parts you need to modify with C code.

Something like this will let you reverse the machine code into source. It won't be pretty but it does work.
http://www.hex-rays.com/idapro/
There are also tools for runtime patching http://www.dyninst.org/ for instance. They really aren't made for patching but they can do the trick.
And of course the last choice is to just use an assembler and write machine code :)

Edit and Continue on GDB

I know that E&C is a controversial subject and some say that it encourages a wrong approach to debugging, but still - I think we can agree that there are numerous cases when it is clearly useful - experimenting with different values of some constants, redesigning GUI parameters on-the-fly to find a good look... You name it.
My question is: Are we ever going to have E&C on GDB? I understand that it is a platform-specific feature and needs some serious cooperation with the compiler, the debugger and the OS (MSVC has this one easy as the compiler and debugger always come in one package), but... It still should be doable. I've even heard something about Apple having it implemented in their version of GCC [citation needed]. And I'd say it is indeed feasible.
Knowing all the hype about MSVC's E&C (my experience says it's the first thing MSVC users mention when asked "why not switch to Eclipse and gcc/gdb"), I'm seriously surprised that after quite some years GCC/GDB still doesn't have such feature. Are there any good reasons for that? Is someone working on it as we speak?

It is a surprisingly non-trivial amount of work, encompassing many design decisions and feature tradeoffs. Consider: you are debugging. The debugee is suspended. Its image in memory contains the object code of the source, and the binary layout of objects, the heap, the stacks. The debugger is inspecting its memory image. It has loaded debug information about the symbols, types, address mappings, pc (ip) to source correspondences. It displays the call stack, data values.
Now you want to allow a particular set of possible edits to the code and/or data, without stopping the debuggee and restarting. The simplest might be to change one line of code to another. Perhaps you recompile that file or just that function or just that line. Now you have to patch the debuggee image to execute that new line of code the next time you step over it or otherwise run through it. How does that work under the hood? What happens if the code is larger than the line of code it replaced? How does it interact with compiler optimizations? Perhaps you can only do this on a specially compiled for EnC debugging target. Perhaps you will constrain possible sites it is legal to EnC. Consider: what happens if you edit a line of code in a function suspended down in the call stack. When the code returns there does it run the original version of the function or the version with your line changed? If the original version, where does that source come from?
Can you add or remove locals? What does that do to the call stack of suspended frames? Of the current function?
Can you change function signatures? Add fields to / remove fields from objects? What about existing instances? What about pending destructors or finalizers? Etc.
There are many, many functionality details to attend to to make any kind of usuable EnC work. Then there are many cross-tools integration issues necessary to provide the infrastructure to power EnC. In particular, it helps to have some kind of repository of debug information that can make available the before- and after-edit debug information and object code to the debugger. For C++, the incrementally updatable debug information in PDBs helps. Incremental linking may help too.
Looking from the MS ecosystem over into the GCC ecosystem, it is easy to imagine the complexity and integration issues across GDB/GCC/binutils, the myriad of targets, some needed EnC specific target abstractions, and the "nice to have but inessential" nature of EnC, are why it has not appeared yet in GDB/GCC.
Happy hacking!
(p.s. It is instructive and inspiring to look at what the Smalltalk-80 interactive programming environment could do. In St80 there was no concept of "restart" -- the image and its object memory were always live, if you edited any aspect of a class you still had to keep running. In such environments object versioning was not a hypothetical.)

I'm not familiar with MSVC's E&C, but GDB has some of the things you've mentioned:
http://sourceware.org/gdb/current/onlinedocs/gdb/Altering.html#Altering
17. Altering Execution
Once you think you have found an error in your program, you might want to find out for certain whether correcting the apparent error would lead to correct results in the rest of the run. You can find the answer by experiment, using the gdb features for altering execution of the program.
For example, you can store new values into variables or memory locations, give your program a signal, restart it at a different address, or even return prematurely from a function.
Assignment: Assignment to variables
Jumping: Continuing at a different address
Signaling: Giving your program a signal
Returning: Returning from a function
Calling: Calling your program's functions
Patching: Patching your program
Compiling and Injecting Code: Compiling and injecting code in GDB

This is a pretty good reference to the old Apple implementation of "fix and continue". It also references other working implementations.
http://sources.redhat.com/ml/gdb/2003-06/msg00500.html
Here is a snippet:
Fix and continue is a feature implemented by many other debuggers,
which we added to our gdb for this release. Sun Workshop, SGI ProDev
WorkShop, Microsoft's Visual Studio, HP's wdb, and Sun's Hotspot Java
VM all provide this feature in one way or another. I based our
implementation on the HP wdb Fix and Continue feature, which they
added a few years back. Although my final implementation follows the
general outlines of the approach they took, there is almost no shared
code between them. Some of this is because of the architectual
differences (both the processor and the ABI), but even more of it is
due to implementation design differences.
Note that this capability may have been removed in a later version of their toolchain.
UPDATE: Dec-21-2012
There is a GDB Roadmap PDF presentation that includes a slide describing "Fix and Continue" among other bullet points. The presentation is dated July-9-2012 so maybe there is hope to have this added at some point. The presentation was part of the GNU Tools Cauldron 2012.
Also, I get it that adding E&C to GDB or anywhere in Linux land is a tough chore with all the different components.
But I don't see E&C as controversial. I remember using it in VB5 and VB6 and it was probably there before that. Also it's been in Office VBA since way back. And it's been in Visual Studio since VS2005. VS2003 was the only one that didn't have it and I remember devs howling about it. They intended to add it back anyway and they did with VS2005 and it's been there since. It works with C#, VB, and also C and C++. It's been in MS core tools for 20+ years, almost continuous (counting VB when it was standalone), and subtracting VS2003. But you could still say they had it in Office VBA during the VS2003 period ;)
And Jetbrains recently added it too their C# tool Rider. They bragged about it (rightly so imo) in their Rider blog.

Are there any lint tools for C and C++ that check formatting?

I have a codebase that is touched by many people. While most people make an effort to keep the code nicely formatted (e.g. consistent indentation and use of braces), some don't, and even those that do can't always do it because we all use different editors, so settings like spaces vs. tabs are different.
Is there any standard lint tool that checks that code is properly formatted, but doesn't actually change it (like indent but that returns only errors and warnings)?
While this question could be answered generally, my focus is on C and C++, because that's what this project is written in.

Google uses cpplint. This is their style guide.

The Linux kernel uses a tool that does exactly this - it's called checkpatch. You'd have to modify it to check your coding standards rather than theirs, but it could be a good basis to work from. (It is also designed for C code rather than C++).

Take a look at Vera++, it has a number of rules already available but the nice part is that you can modify them or write your own.

There are several programs that can do formatting for you automatically on save (such as Eclipse). You can have format settings that everyone can use ensuring the same formatting.
It is also possible to automatically apply such formatting when code is committed. When you use SVN, the system to do this is called svn hooks. This basically starts a program to process (or check and deny) the formatting when a commit happens.
This site explains how you can make your own. But also ones already exist to do this.

Can code formatting lead to change in object file content?

I have run though a code formatting tool to my c++ files. It is supposed to make only formatting changes. Now when I built my code, I see that size of object file for some source files have changed. Since my files are very big and tool has changed almost every line, I dont know whether it has done something disastrous. Now i am worried to check in this code to repo as it might lead to runtime error due to formatting tool. My question is , will the size of object file be changed , if code formatting is changed.?

Brief answer is no:)

I would not check your code into the repo without thoroughly checking it first (review, testing).
Pure formatting changes should not change the object file size, unless you've done a debug build (in which case all bets are off). A release build should be not just the same size, but barring your using __DATE__ and such to insert preprocessor content, it should be byte-for-byte the same as well.
If the "reformatting" tool has actually done some micro-optimizations for you (caching repeated access to invariants in local vars, or undoing your having done that unnecessarily), that might affect the optimization choices the compiler makes, which can have an effect on the object file. But I wouldn't assume that that was the case.

if ##__LINE__ macro is used might produce longer strings. How different are the sizes?
(this macro is often hides in new and assert messages in debug.)

just formatting the code should not change the size of the object file.

It might if you compile with debugging symbols, as it might have added more line number information. Normally it wouldn't though, as has already been pointed out.
Try comparing object files built without debugging symbols.

Try to find a comparison tool that won't care about the formatting changes (like perhaps "diff--ignore-all-space") and check using that before checking in.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js