Binary Reproducibility in Visual C++ - c++

Is there a way to force the same code to produce the same binary in Visual C++? Turn off the timestamp in the PE or force the timestamp in the PE to be some fixed value, in other words?

It's not only a timestamp - there's an embedded GUID used for PDB matching - as John Robbins explains.
Even beyond that, there's just no way to force the compiler to generate consistent results, as Jim Griesmer explains -
compiler writers are far more interested in generating correctly functioning code and generating it quickly than ensuring that whatever is generated is laid out identically on your hard drive. Due to the numerous and varied methods and implementations for optimizing code, it is always possible that one build ended up with a little more time to do something extra or different than another build did. Thus, the final result could be a different set of bits for what is the same functionality.
Thus, function and section order are not guaranteed to be consistently ordered in the resulting PE. An example is at the link.

I suppose you could write a utility to open the PE, set the checksum to 0, set the timestamp to what you like, recompute the crc, then write it back out. It would be nice if there were an official way to ensure perfect binary reproducibility, though.
For more information:
http://msdn.microsoft.com/en-us/magazine/cc301805.aspx

Related

Can an c++ compiled DLL differ in code without size difference?

I use to get vendor parts of my software delivered, they are closed-source and compiled in DLLs, I normally compare File properties in bytes (actual file size) with previous ones.
If i hash the files they will always differ due to an version header, can you answer;
can I rely on taking conclusions such as the source inside the DLLs is the same, once the size in bytes show to be matching for both?
Or is there a better method to check if something has changed? I'm talking about net modules that get regularly updated, sometimes the size changes and I can certainly say an fix has been done on it's source, hence why the final compiled DLLs differ in size, but does a trivial commit onto the source also translate in size difference that Windows can recognize in properties, when looking at the compiled DLLs?
The goal is to know if something has changed in newer versions, while the vendor DLLs are closedsource and they do not provide source.
I would like to clarify the question some more: If the files doesn't differ in size (bytes), is it possible that code has been added? (because changing/replacing the same code characters with the exact same amount would make the same final size) or, if the source contains more characters, will it definately translate to size?
PE files have a specified file alignment, to which they are padded out with zeros. This means DLL is always some multiple of that size. This makes it much more likely that similar, but different DLLs will be the same size on disk.
It should be obvious, too simply that changing an internal constant from a 4 to a 5 would not change the file size, but could have a profound impact on how the code runs.
Your best bet is to generate a hash like MD5 or SHA1 of files, and compare their hashes.
In addition to what Lightness said, there are some tools that will try to compare executable files at the section level, with the goal of determining that two files have the same code, despite differences in Metadata. However, they're definitely considered reverse engineering tools, and might be difficult to use. To get very far down this road, you're going to require a firm understanding of the PE file structure, and probably some x86 assembly as well.
It'd be far easier to ask the upstream vendor. Even closed-source software can have thorough changelogs.
Comparing binaries by file size alone is ridiculous. You can trivially produce two entirely different programs that have the same file size. You should compare files byte-wise, not size-wise.
That said, the only reliable way to tell whether a program's semantics have been changed — be it through a bug fix or new feature — is to read the product's changelog, or examine its source code.

How to measure Code Size?

When certain features or optimizations are discussed, Code Size is often mentioned.
While I certainly understand the basic concept, that is, that a collection of code, compiled to machine code will result in X bytes of machine code (plus static data) I have recently realized that I'm very unsure how to actually measure Code Size of a given binary.
So, how do you measure Code Size?
Do you just check how big the resulting binary ("executable", .exe) is?
Do you need a tool such as dumpbin.exe or some specific linker flags to get detailed results?
You can tell the linker to produce a map file. This gives about the most detailed information that's easy to get (i.e., much short of reverse engineering the code by hand).
Depending on the code, using dumpbin on an object file can produce meaningful results, but can also produce simply "anonymous object" -- especially (exclusively?) when you ask for link-time code generation.
I'd say your best bet is to disassemble the binary.
In the context of code optimizations, total code size isn't typically what is meant, but rather code size for some specific part of your program.
If you mean .exe in bytes in the literal term I think you're over-thinking the question. Your file explorer should say on the right the size of files (if it doesn't, right click the file and open properties). The files you're looking for should be in debug named after .exe
If it's something else, sorry.

How to figure out which methods increases size of 'exe'

I'm trying to write my first 'demoscene' application in MS Visual Studio Express 2010. Suddenly I realized, that my binary expanded from 16kb to ~100kb in fully-optimized-for-size release version. My target size is 64k. Is there any way to somehow "browse" binary to figure out, which methods consumes a lot of space, and which I should rewrite? I really want to know what my binary consists of.
From what I found in web, VS2010 is not the best compiler for demoscenes, but I still want to understand what's happening inside my .exe file.
I think you should have MSVC generate a map file for you. This is a file that will tell you the addresses of most of the different functions in your executable. The difference between consecutive addresses should tell you how much space the function takes. To generate a map file, add the /MAP linker option. For more info, see:
http://msdn.microsoft.com/en-us/library/k7xkk3e2(v=VS.100).aspx
You can strip off lots of unnecessary stuff from the executable and compress it with utilities such as mew.
I've found this useful for examining executable sizes (although not for demoscene type things): http://aras-p.info/projSizer.html
I will say this: if you are using the standard library at all then stop immediately. It is a huge code bloater. For example, each unique usage std::sort adds around 5KB and there's similar numbers for many of the standard containers (of course, it depends what functions you use, but in general they add lots of code).
Also, I'm not into the demo scene, but I believe people use Crinkler to compress their executables.
Use your version contol system to see what caused the increase. Going forward, Id log the built exe size during the nightly builds. And dont forget you can optimize for minimal size with the compiler settings.

exe checksum different after each recompile

So I'm trying to figure out how to get my exe to have the same hash code/checksum when it's recompiled. I'm using FastSum to generate the checksum. Currently, no code changes are made, I'm just rebuilding the project in VS and the checksum comes out different. The code is written in c++.
I'm not familiar with using hash codes and/or checksums in this manner, but I did some research and read something about needing a consistent GUID. But I have no idea how that would tie into the checksum generation program...
Well, I'll leave it at that, thanks in advance.
Have you examined the differences between the exes? I suspect the compiler/linker is inserting the date or time into the binary and as a result each binary will be different from another. Or it could be worse, sometimes compilers/linkers build static tables in their own system memory then copy that into the binary, say you have 9 bytes of something and for alignment reasons the compiler chooses to use 12 bytes in the binary, I have seen compilers/linkers take whatever 3 bytes are in system memory of that computer and copy that into the file. Ideally you would want the tools to zero out memory they are using for such a thing so you get repeatable results.
Basically do a binary diff between the files you should then find out why they dont match.
From what I recall, the EXE format includes a build timestamp so a hash of the exe, including that timestamp, would change on each recompile.
Is this a managed binary? Managed binaries have a GUID section that changes from build to build and there's not much you can do to stop that.
You can get a better look at the changes in your binary by running "link /dump /all [filename]" or "link /dump /disasm [filename]". The /all option will show you all the hex values as well as their ascii equivalent, while the /disasm option will disassemble the code and show it to you in assembly, which can be easier to read but might ignore some trivial differences which might have caused the hash to change.

Open-source C++ scanning library

Rationale: In my day-to-day C++ code development, I frequently need to
answer basic questions such as who calls what in a very large C++ code
base that is frequently changing. But, I also need to have some
automated way to exactly identify what the code is doing around a
particular area of code. "grep" tools such as Cscope are useful (and
I use them heavily already), but are not C++-language-aware: They
don't give any way to identify the types and kinds of lexical
environment of a given use of a type or function a such way that is
conducive to automation (even if said automation is limited to
"read-only" operations such as code browsing and navigation, but I'm
asking for much more than that below).
Question: Does there exist already an open-source C/C++-based library
(native, not managed, not Microsoft- or Linux-specific) that can
statically scan or analyze a large tree of C++ code, and can produce
result sets that answer detailed questions such as:
What functions are called by some supplied function?
What functions make use of this supplied type?
Ditto the above questions if C++ classes or class templates are involved.
The result set should provide some sort of "handle". I should be able
to feed that handle back to the library to perform the following types
of introspection:
What is the byte offset into the file where the reference was made?
What is the reference into the abstract syntax tree (AST) of that
reference, so that I can inspect surrounding code constructs? And
each AST entity would also have file path, byte-offset, and
type-info data associated with it, so that I could recursively walk
up the graph of callers or referrers to do useful operations.
The answer should meet the following requirements:
API: The API exposed must be one of the following:
C or C++ and probably is "C handle" or C++-class-instance-based
(and if it is, must be generic C o C++ code and not Microsoft- or
Linux-specific code constructs unless it is to meet specifics of
the given platform), or
Command-line standard input and standard output based.
C++ aware: Is not limited to C code, but understands C++ language
constructs in minute detail including awareness of inter-class
inheritance relationships and C++ templates.
Fast: Should scan large code bases significantly faster than
compiling the entire code base from scratch. This probably needs to
be relaxed, but only if Incremental result retrieval and Resilient
to small code changes requirements are fully met below.
Provide Result counts: I should be able to ask "How many results
would you provide to some request (and no don't send me all of the
results)?" that responds on the order of less than 3 seconds versus
having to retrieve all results for any given question. If it takes
too long to get that answer, then wastes development time. This is
coupled with the next requirement.
Incremental result retrieval: I should be able to then ask "Give me
just the next N results of this request", and then a handle to the
result set so that I can ask the question repeatedly, thus
incrementally pulling out the results in stages. This means I
should not have to wait for the entire result set before seeing
some subset of all of the results. And that I can cancel the
operation safely if I have seen enough results. Reason: I need to
answer the question: "What is the build or development impact of
changing some particular function signature?"
Resilient to small code changes: If I change a header or source
file, I should not have to wait for the entire code base to be
rescanned, but only that header or source file
rescanned. Rescanning should be quick. E.g., don't do what cscope
requires you to do, which is to rescan the entire code base for
small changes. It is understood that if you change a header, then
scanning can take longer since other files that include that header
would have to be rescanned.
IDE Agnostic: Is text editor agnostic (don't make me use a specific
text editor; I've made my choice already, thank you!)
Platform Agnostic: Is platform-agnostic (don't make me only use it
on Linux or only on Windows, as I have to use both of those
platforms in my daily grind, but I need the tool to be useful on
both as I have code sandboxes on both platforms).
Non-binary: Should not cost me anything other than time to download
and compile the library and all of its dependencies.
Not trial-ware.
Actively Supported: It is likely that sending help requests to mailing lists
or associated forums is likely to get a response in less than 2
days.
Network agnostic: Databases the library builds should be able to be used directly on
a network from 32-bit and 64-bit systems, both Linux and Windows
interchangeably, at the same time, and do not embed hardcoded paths
to filesystems that would otherwise "root" the database to a
particular network.
Build environment agnostic: Does not require intimate knowledge of my build environment, with
the notable exception of possibly requiring knowledge of compiler
supplied CPP macro definitions (e.g. -Dmacro=value).
I would say that CLang Index is a close fit. However I don't think that it stores data in a database.
Anyway the CLang framework offer what you actually need to build a tool tailored to your needs, if only because of its C, C++ and Objective-C parsing / indexing capabitilies. And since it's provided as a set of reusable libraries... it was crafted for being developed on!
I have to admit that I haven't used either because I work with a lot of Microsoft-specific code that uses Microsoft compiler extensions that i don't expect them to understand, but the two open source analyzers I'm aware of are Mozilla Pork and the Clang Analyzer.
If you are looking for results of code analysis (metrics, graphs, ...) why not use a tool (instead of API) to do that? If you can, I suggest you to take a look at Understand.
It's not free (there's a trial version) but I found it very useful.
Maybe Doxygen with GraphViz could be the answer of some of your constraints but not all,for example the analysis of Doxygen is not incremental.