Can code formatting lead to change in object file content?

Can code formatting lead to change in object file content? - c++

I have run though a code formatting tool to my c++ files. It is supposed to make only formatting changes. Now when I built my code, I see that size of object file for some source files have changed. Since my files are very big and tool has changed almost every line, I dont know whether it has done something disastrous. Now i am worried to check in this code to repo as it might lead to runtime error due to formatting tool. My question is , will the size of object file be changed , if code formatting is changed.?

Brief answer is no:)

I would not check your code into the repo without thoroughly checking it first (review, testing).
Pure formatting changes should not change the object file size, unless you've done a debug build (in which case all bets are off). A release build should be not just the same size, but barring your using __DATE__ and such to insert preprocessor content, it should be byte-for-byte the same as well.
If the "reformatting" tool has actually done some micro-optimizations for you (caching repeated access to invariants in local vars, or undoing your having done that unnecessarily), that might affect the optimization choices the compiler makes, which can have an effect on the object file. But I wouldn't assume that that was the case.

if ##__LINE__ macro is used might produce longer strings. How different are the sizes?
(this macro is often hides in new and assert messages in debug.)

just formatting the code should not change the size of the object file.

It might if you compile with debugging symbols, as it might have added more line number information. Normally it wouldn't though, as has already been pointed out.
Try comparing object files built without debugging symbols.

Try to find a comparison tool that won't care about the formatting changes (like perhaps "diff--ignore-all-space") and check using that before checking in.

Related

How do you ascertain that you are running the latest executable?

Every so often I (re)compile some C (or C++) file I am working on -- which by the way succeeds without any warnings -- and then I execute my program only to realize that nothing has changed since my previous compilation. To keep things simple, let's assume that I added an instruction to my source to print out some debugging information onto the screen, so that I have a visual evidence of trouble: indeed, I compile, execute, and unexpectedly nothing is printed onto the screen.
This happened me once when I had a buggy code (I ran out of the bounds of a static array). Of course, if your code has some kind of hidden bug (What are all the common undefined behaviours that a C++ programmer should know about?) the compiled code can be pretty much anything.
This happened me twice when I used some ridiculously slow network hard drive which -- I guess -- simply did not update my executable file after compilation, and I kept running-and-running the old version, despite the updated source. I just speculate here, and feel free to correct me, if such a phenomenon is impossible, but I suspect it has had to do something with certain processes waiting for IO.
Well, such things could of course happen (and they indeed do), when you execute an old version in the wrong directory (that is: you execute something similar, but actually completely unrelated to your source).
It is happening again, and it annoys me enough to ask: how do you make sure that your executable is matching the source you are working on? Should I compare the date strings of the source and the executable in the main function? Should I delete the executable prior compilation? I guess people might do something similar by means of version control.
Note: I was warned that this might be a subjective topic likely doomed to be closed.

Just use ol' good version control possibilities
In easy case you can just add (any) visible version-id in the code and check it (hash, revision-id, timestamp)
If your project have a lot of dependent files and you suspect older version, than "latest", in produced code, you can (except, obvioulsly, good makefile-rules) monitor also version of every file, used for building code (VCS-dependent, but not so heavy trick)

Check the timestamp of your executable. That should give you a hint regarding whether or not it is recent/up-to-date.
Alternatively, calculate a checksum for your executable and display it on startup, then you have a clue that if the csum is the same the executable was not updated.

How to measure Code Size?

When certain features or optimizations are discussed, Code Size is often mentioned.
While I certainly understand the basic concept, that is, that a collection of code, compiled to machine code will result in X bytes of machine code (plus static data) I have recently realized that I'm very unsure how to actually measure Code Size of a given binary.
So, how do you measure Code Size?
Do you just check how big the resulting binary ("executable", .exe) is?
Do you need a tool such as dumpbin.exe or some specific linker flags to get detailed results?

You can tell the linker to produce a map file. This gives about the most detailed information that's easy to get (i.e., much short of reverse engineering the code by hand).
Depending on the code, using dumpbin on an object file can produce meaningful results, but can also produce simply "anonymous object" -- especially (exclusively?) when you ask for link-time code generation.

I'd say your best bet is to disassemble the binary.
In the context of code optimizations, total code size isn't typically what is meant, but rather code size for some specific part of your program.

If you mean .exe in bytes in the literal term I think you're over-thinking the question. Your file explorer should say on the right the size of files (if it doesn't, right click the file and open properties). The files you're looking for should be in debug named after .exe
If it's something else, sorry.

Variable renaming for plagiarism detection for C/C++

I have a couple of simple C++ homeworks and I know the students shared code. These are smart students and they know how to cheat moss. I'm looking for a tool that can rename variables based on their types (first variable of type int will be int1, first int array will be intptr1...), or does something similar that I cannot think of now. Do you know a quick way to do this?
edit: I'm required to use moss and report 90% match
Thanks

Yep, the tool you're looking for is called a compiler. :)
Seriously, if the programs submitted are exactly the same except for the identifier names, compiling then (without debugging info) should result in exactly the same output.
If you do this with debugging turned on, the compiler may leave meta-data in the executable that is different for each executable, hence the comment about ensuring it is off. This is also why this wont work for Java programs - that kind of info is present whether in debug mode or not (for the purposes of dynamic introspection).
EDIT: I see from the comments added to the question that you're observing some submissions that are different in more than just identifier names. If the programs are still structurally equivalent, this should still work.
EDIT: Given that the use of moss is a requirement, this probably isn't the way to go. I does seem though that moss has some support for comparing assembly - perhaps compiling to assembler and submitting that to moss is an option (depending on what compiler you're using).

You can download and try our C CloneDR duplicate code detector. It finds duplicated code even when the variable names have been changed. Multiple changes in the same chunk are treated as just one; if they rename the varaibles consistenly everywhere, you'll get back a report of "one clone" with the precise variable subsitution.

You can try Copy Paste Detector with ignoreIdentifiers turned on. You can at least use it for a first pass before going to the effort of normalizing names for moss. Or, since the source is available, maybe you can get it to spit out its internal normalization of the code.

Another way of doing this would be to compile the applications and compare their binaries, so your examination is not limited to variable/function name changing.
An HEX editor can help you with that. I just tried ExamDiff (not free $) and I was happy with the result.

What are the negative consequences of including and/or linking things that aren't used by your binary?

Let's say that I have a binary that I am building, and I include a bunch of files that are never actually used, and do the subsequent linking to the libraries described by those include files? (again, these libraries are never used)
What are the negative consequences of this, beyond increased compile time?

A few I can think of are namespace pollution and binary size

In addition to compile time; Increased complexity, needless distraction while debugging, a maintenance overhead.
Apart from that, nothing.

In addition to what Sasha lists is maintenance cost. Will you be able to detect easily what is used and what is not used in the future when and if you chose to remove unused stuff?

If the libraries are never used, there should be no increase in size for the executable.

Depending on the exact linker, you might also notice that the global objects of your unused libraries still get constructed. This implies a memory overhead and increases startup costs.

If the libraries you're including but not using aren't on the target system, it won't be able to compile even though they aren't necessary.

Here is my answer for a similar question concerning C and static libraries. Perhaps it is useful to you in the context of C++ as well.

You mention an increase in the compilation time. From that I understand the libraries are statically linked, and not dynamically. In this case, it depends how the linker handles unused functions. If it ignores them, you will have mostly maintenance problems. If they’ll be included, the executable size will increase. Now, this is more significant than the place it takes on the hard drive. Large executables could run slower due to caching issues. If active code and non-active code are adjacent in the exe, they will be cached together, making the cache effectively smaller and less efficient.
VC2005 and above have an optimization called PGO, which orders the code within the executable in a way that ensures effective caching of code that is often used. I don’t know if g++ has a similar optimization, but it’s worth looking into that.

A little compilation here of the issues, wiki-edit it as necessary:
The main problem appears to be: Namespace Pollution
This can cause problems in future debugging, version control, and increased future maintenance cost.
There will also be, at the minimum, minor Binary Bloat, as the function/class/namespace references will be maintained (In the symbol table?). Dynamic libraries should not greatly increase binary size(but they become a dependency for the binary to run?). Judging from the GNU C compiler, statically linked libraries should not be included in final binary if they are never referenced in the source. (Assumption based on the C compiler, may need to clarify/correct)
Also, depending on the nature of your libraries, global and static objects/variables may be instantiated, causing increased startup time and memory overhead.
Oh, and increased compile/linking time.

I find it frustrating when I edit a file in the source tree because some symbol that I'm working on appears in the source file (e.g. a function name, where I've just changed the prototype - or, sadly but more typically, just added the prototype to a header) so I need to check that the use is correct, or the compiler now tells me the use in that file is incorrect. So, I edit the file. Then I see a problem - what is this file doing? And it turns out that although the code is 'used' in the product, it really isn't actively used at all.
I found an occurrence of this problem on Monday. A file with 10,000+ lines of code invoked a function 'extern void add_remainder(void);' with an argument of 0. So, I went to fix it. Then I looked at the rest of the code...it turned out it was a development stub from about 15 years ago that had never been removed. Cleanly excising the code turned out to involve minor edits to more than half-a-dozen files - and I've not yet worked out whether it is safe to remove the enumeration constant from the middle of an enumeration in case. Temporarily, that is marked 'Unused/Obsolete - can it be removed safely?'.
That chunk of code has had zero cove coverage for the last 15 years - production, test, ... True, it's only a tiny part of a vast system - percentage-wise, it's less than a 1% blip on the chart. Still, it is extra wasted code.
Puzzling. Annoying. Depressingly common (I've logged, and fixed, at least half a dozen similar bugs this year so far).
And a waste of my time - and other developers' time. The file had been edited periodically over the years by other people doing what I was doing - a thorough job.

I have never experienced any problems with linking a .lib file of which only a very small part is used. Only the code that is really used will be linked into the executable, and the linking time did not increase noticeably (with Visual Studio).

If you link to binaries and they get loaded at runtime, they may perform non-trivial initialization which can do anything from allocate a small amount of memory to consume scarce resources to alter the state of your module in ways you don't expect, and beyond.
You're better off getting rid of stuff you don't need, simply to eliminate a bunch of unknowns.

It could perhaps even fail to compile if the build tree isn't well maintained. if your'e compiling on embedded systems without swap space. The compiler can run out of memory while trying to compile a massive object file.
It happened at work to us recently.

MAP file analysis - where's my code size comes from?

I am looking for a tool to simplify analysing a linker map file for a large C++ project (VC6).
During maintenance, the binaries grow steadily and I want to figure out where it comes from. I suspect some overzealeous template expansion in a library shared between different DLL's, but jsut browsign the map file doesn't give good clues.
Any suggestions?

This is a wonderful compiler generated map file analysis/explorer/viewer tool. Check if you can explore gcc generated map file.
amap : A tool to analyze .MAP files produced by 32-bit Visual Studio compiler and report the amount of memory being used by data and code.
This app can also read and analyze MAP files produced by the Xbox360, Wii, and PS3 compilers.

The map file should have the size of each section, you can write a quick tool to sort symbols by this size. There's also a command line tool that comes with MSVC (undname.exe) which you can use to demangle the symbols.
Once you have the symbols sorted by size, you can generate this weekly or daily as you like and compare how the size of each symbol has changed over time.
The map file alone from any single build may not tell much, but a historical report of compiled map files can tell you quite a bit.

Have you tried using dumpbin.exe on your .obj files?
Stuff to look for:
Using a lot of STL?
A lot of c++ classes with inline methods?
A lot of constants?
If anything of the above applies to you. Check if they have a wide visibility, i.e. if they are used/seen in large parts of your application.

No suggestion for a tool, but a guess as to a possible cause: do you have incremental linking enabled? This can cause expansion during subsequent builds...
The linker will strip unused symbols if you're compiling with /opt:ref, so if you're using that and not using incremental linking, I would expect expansion of the binaries to be only a result of actual new code being added. That's as far as I know... hope it helps a little.

Templates, macros, STL in general all use a tremendous amount of space. Heralded as a great universal library, BOOST adds much space to projects. BOOST_FOR_EACH is an example of this. Its hundreds of lines of templated code, which could simply be avoided by writing a proper loop handle, which is in general only a few more key strokes.
Get Visual AssistX to save typing, not using templates. Also consider owning the code you use. Macros and inline function expansion are not necessarily going to show up.
Also, if you can, move away from DLL architecture to statically linking everything into one executable which runs in different "modes". There is absolutely nothing wrong with using the same executable image as many times as you want just passing in a different command line parameter depending on what you want it to do.
DLL's are the worst culprit for wasting space and slowing down the running time of a project. People think they are space savers, when in fact they tend to have the opposite effect, sometimes increasing project size by ten times! Plus they increase swapping. Use fixed code sections (no relocation section) for performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js