Reverse engineering your own code c++

Reverse engineering your own code c++ - c++

I have a compiled program which I want to know if a certain line exist in it. Is there a way, using my source code, I could determine that?
Tony commented on my message so I'll add some info:
I'm using the g++ compiler.
I'm compiling the code on Linux(Scientific)/Unix machine
I only use standard library (nothing downloaded from the web)
The desired line is either multiplication by a number (in a subfunction of a while group) or printing a line in a specific case (if statement)
I need this becouse I'm running several MD simulations and sometimes I find my self in a situation where I'm not sure of the conditions.

objdump is a utility that can be used as a disassembler to view executable in assembly form.
Use this command to disassemble a binary,
objdump -Dslx file
Important to note though that disassemblers make use of the symbolic debugging information present in object files(ELF), So that information should be present in your object files. Also, constants & comments in source code will not be a part of the disassembled output.

Summary
Use source code control and keep track of which source code revision the executable's built from... it should write that into the output so you can always cross-reference the two, checkout the same sources and rebuild the executable that gave you those results etc..
Discussion
The desired line is either multiplication by a number (in a subfunction of a while group) or printing a line in a specific case (if statement)
I need this becouse I'm running several MD simulations and sometimes I find my self in a situation where I'm not sure of the conditions.
For the very simplest case where you want all the MD simulations to be running the latest source, you can compare timestamps on the source files with the executable to see if you forgot to recompile, compare the process start time (e.g. as listed by ps) with the executable creation time.
Where you're deliberately deploying multiple versions of the program and only have the latest source, then it gets pretty tricky. A multiplication will typically only generate a single machine code instruction... unless you have some contextual insight you're unlikely to know which multiplication is significant (or if it's missing). The compiler may generate its own multiplications for e.g. array indexing, and may sometimes optimise multiplications into bit shifts (or nothing, as Ira comments), so it's not as simple as saying 'well, it's my only multiplication in function "X"'. If you're printing a specific line that may be easier to distinguish... if there's a unique string literal you can search for it in the executable (e.g. puts("Hello") -> strings program | grep Hello, though that may get other matches too, and the compiler's allowed to reuse string literal sequences so "Well Hello" might cater to your need via a pointer to 'H' too). If there's a new extern symbol involved you might see it in nm output etc..
All that said (woah)... you should do something altogether different really. Best is to use a source control system (e.g. svn, cvs...), and get it configured so you can do something to find out which revision of the codebase was used to create the executable - it should be a FAQ for any revision control system.
Failing that, you could, for example, do something to print out what multipliers or conditions the progarm was using when it starts running, capturing that in your logs. While hackish, macros allow you to "stringify" their parameters, so you can log and execute something without typing all the code twice. Lots of other options too.
Hope some of that helps....

Related

How to export function names and variable names using GCC or clang?

I am making a commercial software and I don't want for it to be easily crackable. It is targeted for Linux and I am compiling it using GCC (8.2.1). The problem is that when I compile it, technically anyone can use disassembler like IDA or Binary Ninja to see all functions names. Here is example (you can see function names on left panel):
Is there any way to protect my program from this kind if reverse engineering? Is there any way of exporting all if these function names and variables from code automatically (with GCC or clang?), so I can make a simple script to change them to completely random before compilation?

So you want to hide/mask the names of symbols in your binary. You've decided that, to do this, you need to get a list of them so that you can create a script to modify them. Well, you could get that list with nm but you don't need any of that (rewriting names inside a compiled binary? oof… recipe for disaster).
Instead, just do what everybody does in a release build and strip the symbols! You'll see a much smaller binary, too. Of course this doesn't prevent reverse engineering (nothing does), though it arguably makes said task more difficult.
Honestly you should be stripping your release binaries anyway, and not to prevent cracking. Common wisdom is not to try too hard to prevent cracking, because you'll inevitably fail, and at the cost of wasted dev time in the attempt (and possibly a more complex codebase that's harder to maintain / a more complex executable that is less fast and/or useful for the honest customer).

Checking library integrity

I created a script to remove useless code in many c++ libs (like ifdefs, comments, etc.)
Now, I want to compare the original lib and the "treated" lib to check if my script has done a good job.
The only solution I found is to compare the exported symbols.
I'm wondering if you have any other ideas to check the integrity?

FIRST of all: Unit tests are designed for this purpose.
You might get some mileage out of
compiling without optimization (-O0) and without debug information (or strip it afterwards)
objdump -dCS
and compare the disassemblies. Prepare to meet some / many spurious errors (the strip step was there to prevent needless differences in source line number info). In particular you will have to
ignore addresses
ignore generated label names
But if the transformation would really lead to unmodified code, you'd be able to verify it 1:1 using this technique and a little work.

assert based unit test would help you. Have some test cases , run them against the original library and then run with the code removed .

How do I tell gcov to ignore un-hittable lines of C++ code?

I'm using gcov to measure coverage in my C++ code. I'd like to get to 100% coverage, but am hampered by the fact that there are some lines of code that are theoretically un-hittable (methods that are required to be implemented but which are never called, default branches of switch statements, etc.). Each of these branches contains an assert( false ); statement, but gcov still marks them as un-hit.
I'd like to be able to tell gcov to ignore these branches. Is there any way to give gcov that information -- by annotating the source code, or by any other mechanism?

Please use lcov. It hides gcov's complexity, produces nice output, allows detailed output per test, features easy file filtering and - ta-taa - line markers for already reviewed lines:
From geninfo(1):
The following markers are recognized by geninfo:
LCOV_EXCL_LINE
Lines containing this marker will be excluded.
LCOV_EXCL_START
Marks the beginning of an excluded section. The current line is part of this section.
LCOV_EXCL_STOP
Marks the end of an excluded section. The current line not part of this section.

A tool called gcovr can be used to summarise the output of gcov, and (from at least version 3.4) it supports the same exclusion markers as lcov.
From this answer:
The following markers are recognized by geninfo:
LCOV_EXCL_LINE
Lines containing this marker will be excluded.
LCOV_EXCL_START
Marks the beginning of an excluded section. The current line is part of this section.
LCOV_EXCL_STOP
Marks the end of an excluded section. The current line not part of this section.
You can also replace 'LCOV' above with 'GCOV' or 'GCOVR'. They all work.

Could you introduce unit tests of the relevant functions, that exist solely to shut gcov up by directly attacking the theoretically-unhittable code paths? Since they're unit tests, they could perhaps ignore the "impossibility" of the situations. They could call the functions that are never called, pass invalid enum values to catch default branches, etc.
Then either run those tests only on the version of your code compiled with NDEBUG, or else run them in a harness which tests that the assert is triggered - whatever your test framework supports.
I find it a bit odd though for the spec to say that the code has to be there, rather than the spec containing functional requirements on the code. In particular, it means that your tests aren't testing those requirements, which is as good a reason as any to keep requirements functional. Personally I'd want to modify the spec to say, "if called with an invalid enum value, the function shall fail an assert. Callers shall not call the function with an invalid enum value in release mode". Or some such.
Presumably what it currently says, is along the lines of "all switch statements must have a default case". But that means coding standards are interfering with observable behaviour (at least, observable under gcov) by introducing dead code. Coding standards shouldn't do that, so the functional spec should take account of the coding standards if possible.
Failing that, you could perhaps wrap the unhittable code in #if !GCOV_BUILD, and do a separate build for gcov's benefit. This build will fail some requirements, but conditional on your analysis of the code being correct, it gives you the confidence you want that the test suite tests everything else.
Edit: you say you're using a dodgy code generator, but you're also asking for a solution by annotating the source code. If you're changing the source, can you just remove the dead code in many cases? Not that changing generated source is ideal, but needs must...

I do not believe this is possible. Gcov depends on gcc to generate extra code to produce the coverage output. GCov itself just parses the data. This means that Gcov cannot analyze the code any better than gcc (and I assume you use -Wall and have removed code reported as unreachable).
Remember that relocatable functions can be called from anywhere, potentially even external dlls or executables so there is no way the compiler can know what relocatable functions will not be called or what input these functions may have.
You probably will need to use some facy static analysis tool to get the info that you want.

Show Delphi And C++ Source Code

How can I see the source code of an executable compiled by Delphi or C++?
Please help me.
After Edit:
I have a program. When I start this program, it shows a dialog and asks for a password. This password is saved in source code. I want to take this password quickly and easily.

You can't.
An enormous amount of information is thrown away when the compiler reduces human readable text source code down to machine executable code. Local variables don't need names in machine code, for example, they're just register bits in the instruction opcode.
This is why debugging a compiled executable to step through the original source files line by line can only be done if you have the compiler debug symbols to go with the executable.
There are utilities that attempt to reverse engineer machine code into source code, but the result is less readable to humans than the original machine code, in my opinion. Machine generated function names, machine generated local variables and arguments, and many times the utility has to guess as to the exact data types of arguments and local vars. (is this arg a signed int or an unsigned int? Hard to tell when it's just a stack slot or machine register)
Compiling to an intermediate representation, as is done in Java and .NET, provides for much more reversibility because the types and symbol names of much of the original code are retained. Reflector, for example, can emit C# source code that is very close to the original human written source code.

You can't. The compiler takes the source code and turns it into machine instructions leaving 'no trace' of the original source code behind.
There are programs called de-compilers, but they just basically automate reverse-engineering, they can't actually access the original source code because that's long gone.

by using a disassembler or decompiler. You can't ever get the original source code back from a binary though. That information is lost.

How Can I See a Source Code of Executive File Compiled By Delphi or C++?
You can't, because source code does not exist in compiled Delphi/C++ program.
I Have a Program.When I Start This Program,Show a Dialog And Ask a Password.This Password Saved in Source Code.I Want take This Password Quickly And Easily.
Trying to crack something, huh?
It is quite possible that password is not saved in source code. Hash function can be used on a password to check if it is valid without storing password in a source code. Even if you find a hash, it won't be easy to get a password from it.
You can get an assembler listing from program using a disassembler (Ida Pro, OllyDBG, or similar tool). And you could debug your program even without source code, although you'll see pure assembly. AFAIK, "decompilers" exist, but I haven't ever used one of them, and doubt that they will be useful for C++/Delphi code (the one that compiles into native application).
There are a few simple techniques that would allow to hack program and bypass password check (if some conditions are met, program author wasn't into security, protection is easy, etc), but I'm not sure if this is allowed discussion topic on stackoverflow.
Anyway, if you're interested in reverse engineering for legal purposes, you could try a book called "Reversing: Secrets of Reverse Engineering".

When you say "executive" do you mean "executable"? If so, decompiling will only get you assembly. Some decompilers will try to turn the assembly into a more readable form, but there's no general way to get the source code from an exe unless you actually compile the source code into the file.

First off, the password is not saved in the source code. The compilation process is one-way only; the finished product isn't going to go altering its source. (Or its binary, for that matter, in most cases at least.) The password is most likely saved in a data file someplace. And if the program's author is at all competent, the password is hashed or encrypted in some way. Decompiling the program won't help you much.
Also, as InsertNickHere mentioned, we're not a hacking site here. We're honorable coders helping each other out with the complexities involved in building legitimate software. Please take your shady questions elsewhere.

What do C and Assembler actually compile to? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
So I found out that C(++) programs actually don't compile to plain "binary" (I may have gotten some things wrong here, in that case I'm sorry :D) but to a range of things (symbol table, os-related stuff,...) but...
Does assembler "compile" to pure binary? That means no extra stuff besides resources like predefined strings, etc.
If C compiles to something else than plain binary, how can that small assembler bootloader just copy the instructions from the HDD to memory and execute them? I mean if the OS kernel, which is probably written in C, compiles to something different than plain binary - how does the bootloader handle it?
edit: I know that assembler doesn't "compile" because it only has your machine's instruction set - I didn't find a good word for what assembler "assembles" to. If you have one, leave it here as comment and I'll change it.

C typically compiles to assembler, just because that makes life easy for the poor compiler writer.
Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:
Code and data appear in named "sections".
Relocatable object files may include definitions of labels, which refer to locations within the sections.
Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.
For example, if you compile and assemble (but don't link) this program
int main () { printf("Hello, world\n"); }
you are likely to wind up with a relocatable object file with
A text section containing the machine code for main
A label definition for main which points to the beginning of the text section
A rodata (read-only data) section containing the bytes of the string literal "Hello, world\n"
A relocation entry that depends on printf and that points to a "hole" in a call instruction in the middle of a text section.
If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o, and you can explore the label definitions and uses with a simple tool called nm, and you can get more detailed information from a somewhat more complicated tool called objdump.
I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.

Let's take a C program.
When you run gcc, clang, or 'cl' on the c program, it will go through these stages:
Preprocessor (#include, #ifdef, trigraph analysis, encoding translations, comment management, macros...) including lexing into preprocessor tokens and eventually resulting in flat text for input to the compiler proper.
Lexical analysis (producing tokens and lexical errors).
Syntactical analysis (producing a parse tree and syntactical errors).
Semantic analysis (producing a symbol table, scoping information and scoping/typing errors) Also data-flow, transforming the program logic into an "intermediate representation" that the optimizer can work with. (Often an SSA). clang/LLVM uses LLVM-IR, gcc uses GIMPLE then RTL.
Optimization of the program logic, including constant propagation, inlining, hoisting invariants out of loops, auto-vectorization, and many many other things. (Most of the code for a widely-used modern compiler is optimization passes.) Transforming through intermediate representations is just part of how some compilers work, making it impossible / meaningless to "disable all optimizations"
Outputing into assembly source (or another intermediate format like .NET IL bytecode)
Assembling of the assembly into some binary object format.
Linking of the assembly into whatever static libraries are needed, as well as relocating it if needed.
Output of final executable in elf, PE/coff, MachO64, or whatever other format
In practice, some of these steps may be done at the same time, but this is the logical order. Most compilers have options to stop after any given step (e.g. preprocess or asm), including dumping internal representation between optimization passes for open-source compilers like GCC. (-ftree-dump-...)
Note that there's a 'container' of elf or coff format around the actual executable binary, unless it's a DOS .com executable
You will find that a book on compilers(I recommend the Dragon book, the standard introductory book in the field) will have all the information you need and more.
As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaders covers.
I've wiki'd this answer to let people tweak any errors/add information.

There are different phases in translating C++ into a binary executable. The language specification does not explicitly state the translation phases. However, I will describe the common translation phases.
Source C++ To Assembly or Itermediate Language
Some compilers actually translate the C++ code into an assembly language or an intermediate language. This is not a required phase, but helpful in debugging and optimizations.
Assembly To Object Code
The next common step is to translate Assembly language into an Object code. The object code contains assembly code with relative addresses and open references to external subroutines (methods or functions). In general, the translator puts in as much information into an object file as it can, everything else is unresolved.
Linking Object Code(s)
The linking phase combines one or more object codes, resolves references and eliminates duplicate subroutines. The final output is an executable file. This file contains information for the operating system and relative addresses.
Executing Binary Files
The Operating System loads the executable file, usually from a hard drive, and places it into memory. The OS may convert relative addresses into physical locations. The OS may also prepare resources (such as DLLs and GUI widgets) that are required by the executable (which may be stated in the Executable file).
Compiling Directly To Binary
Some compilers, such as the ones used in Embedded Systems, have the capability to compile from C++ directly to an executable binary code. This code will have physical addresses instead of relative address and not require an OS to load.
Advantages
One of the advantages of these phases is that C++ programs can be broken into pieces, compiled individually and linked at a later time. They can even be linked with pieces from other developers (a.k.a. libraries). This allows developers to only compiler pieces in development and link in pieces that are already validated. In general, the translation from C++ to object is the time consuming part of the process. Also, a person doesn't want to wait for all the phases to complete when there is an error in the source code.
Keep an open mind and always expect the Third Alternative (Option).

To answer your questions, please note that this is subjective as there are different processors, different platforms, different assemblers and C compilers, in this case, I will talk about the Intel x86 platform.
Assemblers do not usually assemble to pure / flat binary (raw machine code), instead usually to a file defined with segments such as data, text and bss to name but a few; this is called an object file. The Linker steps in and adjusts the segments to make it executable, that is, ready to run. Incidentally, the default output when you assemble using GNU as foo.s is a.out, that is a shorthand for Assembler Output. (But the same filename is the gcc default for linker output, with the assembler output being only a temporary.)
Boot loaders have a special directive defined, back in the days of DOS, it would be common to find a directive such as .Org 100h, which defines the assembler code to be of the old .COM variety before .EXE took over in popularity. Also, you did not need to have a assembler to produce a .COM file, using the old debug.exe that came with MSDOS, did the trick for small simple programs, the .COM files did not need a linker and were straight ready-to-run binary format. Here's a simple session using DEBUG.
1:*a 0100
2:* mov AH,07
3:* int 21
4:* cmp AL,00
5:* jnz 010c
6:* mov AH,07
7:* int 21
8:* mov AH,4C
9:* int 21
10:*
11:*r CX
12:*10
13:*n respond.com
14:*w
15:*q
This produces a ready-to-run .COM program called 'respond.com' that waits for a keystroke and not echo it to the screen. Notice, the beginning, the usage of 'a 100h' which shows that the Instruction pointer starts at 100h which is the feature of a .COM. This old script was mainly used in batch files waiting for a response and not echo it. The original script can be found here.
Again, in the case of boot loaders, they are converted to a binary format, there was a program that used to come with DOS, called EXE2BIN. That was the job of converting the raw object code into a format that can be copied on to a bootable disk for booting. Remember no linker is run against the assembled code, as the linker is for the runtime environment and sets up the code to make it runnable and executable.
The BIOS when booting, expects code to be at segment:offset, 0x7c00, if my memory serves me correct, the code (after being EXE2BIN'd), will start executing, then the bootloader relocates itself lower down in memory and continue loading by issuing int 0x13 to read from the disk, switch on the A20 gate, enable the DMA, switch onto protected mode as the BIOS is in 16bit mode, then the data read from the disk is loaded into memory, then the bootloader issues a far jump into the data code (likely to be written in C). That is in essence how the system boots.
Ok, the previous paragraph sounds abstracted and simple, I may have missed out something, but that is how it is in a nutshell.

To answer the assembly part of the question, assembly doesn't compile to binary as I understand it. Assembly === binary. It directly translates. Each assembly operation has a binary string that directly matches it. Each operation has a binary code, and each register variable has a binary address.
That is, unless Assembler != Assembly and I'm misunderstanding your question.

They compile to a file in a specific format (COFF for Windows, etc), composed of headers and segments, some of which have "plain binary" op codes. Assemblers and compilers (such as C) create the same sort of output. Some formats, such as the old *.COM files, had no headers, but still had certain assumptions (such as where in memory it would get loaded or how big it could be).
On Windows machines, the OS's boostrapper is in a disk sector loaded by the BIOS, where both of these are "plain". Once the OS has loaded its loader, it can read files that have headers and segments.
Does that help?

There are two things that you may mix here. Generally there are two topics:
Executable File Formats (see a list here), for example COFF, XCOFF, ELF
Intermediate Languages, like CIL or GIMPLE or bytecode
The latter may compile to the former in the process of assembly. Some intermediate formats are not assembled, but executed by a virtual machine. In case of C++ it may be compiled into CIL, which is assembled into a .NET assembly, hence there me be some confusion.
But in general C and C++ are usually compiled into binary, or in other words, into a executable file format.

You have a lot of answers to read through, but I think I can keep this succinct.
"Binary code" refers to the bits that feed through the microprocessor's circuits. The microprocessor loads each instruction from memory in sequence, doing whatever they say. Different processor families have different formats for instructions: x86, ARM, PowerPC, etc. You point the processor at the instruction you want by giving it the address of the instruction in memory, and then it chugs merrily along through the rest of the program.
When you want to load a program into the processor, you first have to make the binary code accessible in memory so it has an address in the first place. The C compiler outputs a file in the filesystem, which has to be loaded into a new virtual address space. Therefore, in addition to binary code, that file has to include the information that it has binary code, and what its address space should look like.
A bootloader has different requirements, so its file format might be different. But the idea is the same: binary code is always a payload in a larger file format, which includes at a minimum a sanity check to ensure that it's written in the correct instruction set.
C compilers and assemblers are typically configured to produce static library files. For embedded applications, you're more likely to find a compiler which produces something like a raw memory image with instructions beginning at address zero. Otherwise, you can write a linker which converts the output of the C compiler into whatever else you want.

As I understand it, a chipset (CPU, etc.) will have a set of registers for storing data, and understand a set of instructions for manipulating these registers. The instructions will be things like 'store this value to this register', 'move this value', or 'compare these two values'. These instructions are often expressed in short human-grokable alphabetic codes (assembly language, or assembler) which are mapped to the numbers that the chipset understands - those numbers are presented to the chip in binary (machine code.)
Those codes are the lowest level that the software gets down to. Going deeper than that gets into the architecture of the actual chip, which is something I haven't gotten involved in.

The executable files (PE format on windows) cannot be used to boot the computer because the PE loader is not in memory.
The way bootstrapping works is that the master boot record on the disk contains a blob of a few hundred bytes of code. The BIOS of the computer (in ROM on the motherboard) loads this blob into memory and sets the CPU instruction pointer to the beginning of this boot code.
The boot code then loads a "second stage" loader, on Windows called NTLDR (no extension) from the root directory. This is raw machine code that, like the MBR loader, is loaded into memory cold and executed.
NTLDR has the full capability to load PE files including DLLs and drivers.

С(++) (unmanaged) really compiles to plain binary. Some OS-related stuff - are BIOS and OS function calls, they're different for each OS, but still binary.
1. Assembler compiles to pure binary, but, as strange as it gets, it is less optimized than C(++)
2. OS kernel, as well as bootloader, also written in C, so no problems here.
Java, Managed C++, and other .NET stuff, compiles into some pseudocode (MSIL in .NET), which makes it cross-OS and cross-platform, but requires local interpreter or translator to run.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js