Comparing the Checksums of Two executables built from the same exact source - c++

I have a question regarding the verification of executable files, compiled with visual studio, using a checksum:
If I build a project from src, I end up with an executable, call it exec1.exe, that has some metadata in it.
If I later rebuild the same exact src, I get another executable, say exec2.exe, it also has its own metadata section.
If I create a checksum for each of the two files, they differ since the metadata information between the two files is different.
Does anyone know of a way to bypass the metadata when I do a checksum of the files, so that regardless of the metadata, doing a checksum of the two files will result in the same checksum value? Or how to compile the binaries, such that as long as the src is identical, I end up with the same executables?
Thanks in advance for your input,
Regards

There is no guarantee that Visual C++ will generate the same binary image when building the same source files on successive builds. The checksum is not intended to be used in this manner, and after a bit of research it seems that this is difficult to achieve. Rather, resources such as this kb article can help in comparing files.
Checksums are usually used to find errors resulting from sending/storing data, not to compare versions/builds of an executable.

If you have the pdb file as well you can use the DIA sdk to query all the source files that were used to build the executable. Basically enumerate all the IDiaSourceFile and each IDiaSourceFile has a get_checksum method. You can generate a master checksum that is combination of all the checksums of source files that were used to make the executable. If any of the checksum of any source file has changed you can kind of assume that the executable has changed as well.
This is the same mechanism that Visual Studio uses to determine if a source file is in sync with the pdb so that it can be stepped into for debugging purposes.

Related

Integrate Google Protocol Buffers .proto files to Visual C++ 2010

I've added a custom build step to my Visual Studio project files which generates the google protobuf .h/.cc files from the .proto input files. But I've been wondering if it's possible to start a compile only if the content of the proto files has changed?
Is there a way to tell VisualStudio from a custom build step exactly that? What is the optimal way to integrate proto files into a visual studio build solution?
At the moment, at every build the .proto file is updated which then also updates the time stamp of the output .h/.cc files ...which then issues a recompile of everything dependent from that. Is there a better way around it, while still building them directly from visual studio?
Follow these detailed instructions to specify Custom Build Tool.
Considering your proto file resides together with .h/.cpp files in standard project configuration, here are values to be inserted in Custom Build Tool:
Command Line:
path\to\protoc --proto_path=$(ProjectDir) --cpp_out=$(ProjectDir) %(FullPath)
Outputs:
$(ProjectDir)%(Filename).pb.h;$(ProjectDir)%(Filename).pb.cc
Please note usage of item metadata macros, which replaced some of deprecated macros (like $(InputDir) and $(InputName)).
Now Protocol Buffers compiler will be run only when Input file (i.e. %(FullPath)) is newer than "Outputs".
Maybe this helps. Especially look at the post of Igor Zavoychinskiy:
Solution of this nasty problem is actually simple: in outputs sections
you should specify full path(s). This isn't explicitly stated anywhere
but without this checker just fails to find the files and, hence,
assumes they don't exist. For example for protobuffers compiling
outputs section will be like this:
$(InputDir)\$(InputName).pb.cc;$(InputDir)\$(InputName).pb.h
and (maybe?) kmote00:
...
Bottom line: I just had to make sure my "Outputs" entry exactly
matched the Default Value in the (user-defined) "OutputFile" property.
(Thankfully this also obviated the need for a two-pass build, which
was another annoyance I had previously put up with.)

Committing a TLB file to repository

I'm importing a TLB file into my project since I'm using a COM DLL. A TLB file is a binary file, which I need to compile my source code and so I was wondering if it's good programming practice to commit it to the repository.
Yes, it's ok to put binary files in a source repository. The rule sometimes called 'do not put binary files in a source repository' should better be called 'do not put temporary files or files that are a compilation result in a source repository'. Basically anything that can't be produced from other files and is relevant for the project itself (i.e. no editor preference files) can be put in a repository.
A type library is normally created by midl.exe from an interface definition language (IDL) source file. Or from a utility like Tlbexp.exe or Regasm.exe which can generate a type library from a .NET assembly. If you don't have the source for the type library then there's little else you can do but check-in the .tlb. Note that a type library is very commonly embedded as a resource in the COM server. So checking in the binaries is an option too.
Note that it is technically possible to reverse engineer the IDL from the type library with the Oleview.exe File + View Typelib command. Not so sure that's useful when you don't actually control the source.

Combine multiple DLL's into 1

I'm wondering if it's possible to combine multiple DLL's into 1. I'm currently working on a C++ project that is dependent on many dynamic link libraries,so would it be possible to combine them into 1 DLL file, and if so, how would I do that?
I do have the source code for these DLLs, yes.
Just combine all the source files from all the DLL projects into a single DLL project?
And if you have multiple *.def files (one for each project) then combine them into a single *.def file.
Realistically, no. In theory, if you wanted to badly enough you could do something like disassembling all of them, then re-assembling all the separate files into object files, then re-linking those object files into one big DLL. Getting this to actually work would usually be non-trivial though -- there are likely to be things like conflicting symbol names that would require considerable work to get around.
A rather cleaner possibility would be to package all the DLLs into a zip file (or whatever you prefer) and have a small program to unzip them to a temporary directory, run the main program, and then erase the DLLs from that directory. This has a few problems of its own though (e.g., leaving copies of the files if the machine crashes/loses power/whatever during a run).
Edit: Since you have the source code, using it to build all the code into a single DLL is much more reasonable. For the most part, it's just a matter of adding all the source files to a single project that creates one DLL as its output. You may (easily) run into some symbol conflicts. Given access to the source code, the obvious way to deal with this would be by putting things into namespaces.
Its certainly not infeasible. The Dll format contains all the information you need to merge the code and data from multiple dlls into one, and rebase the resulting code.
this is not a standard feature of any toolchain I can think of though.

What is the difference between compile code and executable code?

I always use the terms compile and build interchangeably.
What exactly do these terms stand for?
Compiling is the act of turning source code into object code.
Linking is the act of combining object code with libraries into a raw executable.
Building is the sequence composed of compiling and linking, with possibly other tasks such as installer creation.
Many compilers handle the linking step automatically after compiling source code.
From wikipedia:
In the field of computer software, the term software build refers either to the process of converting source code files into standalone software artifact(s) that can be run on a computer, or the result of doing so. One of the most important steps of a software build is the compilation process where source code files are converted into executable code.
While for simple programs the process consists of a single file being compiled, for complex software the source code may consist of many files and may be combined in different ways to produce many different versions.
A build could be seen as a script, which comprises of many steps - the primary one of which would be to compile the code.
Others could be
running tests
reporting (e.g. coverage)
static analysis
pre and post-build steps
running custom tools over certain files
creating installs
labelling them and deploying/copying them to a repository
They often are used to mean the same thing. However, "build" may also mean the full process of compiling and linking a whole application (in the case of e.g. C and C++), or even more, including, among others
packaging
automatic (unit and/or integration) testing
installer generation
installation/deployment
documentation/site generation
report generation (e.g. test results, coverage).
There are systems like Maven, which generalize this with the concept of lifecycle, which consists of several stages, producing different artifacts, possibly using results and artifacts from previous stages.
From my experience I would say that "compiling" refers to the conversion of one or several human-readable source files to byte code (object files in C) while "building" denominates the whole process of compiling, linking and whatever else needs to be done of an entire package or project.
Most people would probably use the terms interchangeably.
You could see one nuance : compiling is only the step where you pass some source file through the compiler (gcc, javac, whatever).
Building could be heard as the more general process of checking out the source, creating a target folder for the compiled artifacts, checking dependencies, choosing what has to be compiled, running automated tests, creating a tar / zip / ditributions, pushing to an ftp, etc...

Repeatable object code generation c++

When I build a project using a c++ compiler, can I make sure that the produced binary is not affected if there were no changes in source code? It looks like everytime I recompile my source, the binary's md5 checksum is affected. Is the time of compilation somehow affecting the binary produced? How can I produce compilation results which are repeatable?
One can disassemble the binaries and run md5 on the output
Example on MacOSX
otool -tV a.out | md5
ee2e724434a89fce96aa6b48621f7220
But, one misses out on the global data...(might be a parameter to include too)
I'm answering on the problem of md5 checking a binary...how you manage your sources and build system as others have written about is also a thing to look at
I suspect it'll heavily depend on your toolchain and OS. For example, if one of the executable headers contains a timestamp then you're always going to find the resulting MD5 is different.
What's the end result you're trying to achieve (ie why is it so important that they're identical)..?
You can't do a md5 checksum comparison for visual studio. For a normal Release version .exe file from visual studio there will be 3 locations that change with each recompile. 2 of them are timestamps and the third is a unique GUID that visual studio uses to match versions of the .exe with helper files to ensure they are in sync.
It might be possible to write a tool that will zero out the 3 changing fields, but I'm not sure how easy it would be to parse the file.
Also, if you are calling any .dlls, if I recall right, you will get more unique identifiers in the generated file.
The Debug version is a different story. I think there are many, many more differences.
Use an incremental build system - such as make to ensure you don't recompile your code if the source doesn't change.
It may be possible to get your compile to make identical binaries from the same source - or it may not - it depends on the compiler. Most will embed the current time in the generated binary somewhere.