I have a Fortran project comprising around 100 files. All of them are compiled but not all of them are needed for a run. Also some files contain a mix of used / unused subroutines.
Is there a way to know the minimum set of files / subroutines / functions needed for a run?
Yes.
Start with just the top-level main program.
Compile/Link it and get the list of unsatisfied references.
For each of the unsatisfied references, add the file containing it.
Repeat these two steps until there are no more unsatisfied references.
This removes all files not needed for a clean build.
Yes, it is a lot of work. Things are not always easy.
If you want to know what's not needed for a run with particular input,
for that you need a coverage tool such as gcov.
Routines that are not used will show a coverage count of 0.
Then if you really want to get rid of them,
extract them into separate source files so you can exclude them, and make sure they are not called.
Related
I'm Learning C++ and coming from GML which is pretty similar in syntax. But what I don't understand is why it's recommended to break up a project into multiple files and multiple types of files.
From what I understand, cpp files are the actual program code so people use multiple to break up the functions and parts of the program into seperate files. Header files are the same code as cpp but they're used by cpp files so it's repeated code that can be referenced in multiple places. Is this correct?
To me it doesn't make as much sense because you're not gonna be swapping out files for different builds, you still have to recompile and all the files get merged into a binary (or a few if there's dlls).
So to me, wouldn't it make more sense to just have one large file? Instead of having header files that have repeated references to them in many cpp files, just have the header files code at the top of a single cpp file and create sections down the file using regions for the content of what would be many cpp files.
I've seen a few tutorials that make small games like snake using a single file and they just create sections as they move down. First initializing variables, then another section with all the functions and logic. Then a renderer then at the bottom the main function. Couldn't this be scaled up for a large project? Is it just for organization because, while I'm still learning, I feel it's more confusing to search through many files trying to track what files reference which other files and where things are happening. Vs if it was all in one file you could just scroll down and any references that are made would be to code defined above.
Is this wrong or just not common? Are there drawbacks?
If someone could shed some insight I'd appreciate it.
You can use one file if you like. Multiple files have benefits.
Yes, you do swap different files in and out for different builds. Or, at least, some people do. This is a common technique for making projects that run on multiple platforms.
It is hard to navigate a very large file. Projects can have 100,000 lines of code, or 2,000,000 lines of code, and if you put everything in one file, it may be extremely difficult to edit. Your text editor may slow to a crawl, or it may even be unable to load the entire file into memory.
It is faster to build a project incrementally. C++ suffers from relatively long build times, on average, for typical projects. If you have multiple files, you can build incrementally. This is often faster, since you only have to recompile the files that changed and their dependencies.
If your project is extremely large, and you put everything in one file, it’s possible that the compiler will run out of memory compiling it.
You can make unnamed namespaces and static variables / static functions, which cannot be called from other files. This lets you write code which is private to one file, and prevents you from accidentally calling it or accessing the variables from other files.
Multiple team members can each work on different files, and you will not get merge conflicts when you both push your changes to a shared repository. If you both work on the same file, you are more likely to get merge conflicts.
I feel it's more confusing to search through many files trying to track what files reference which other files and where things are happening.
If you have a good editor or IDE, you can just click on a function call or press F12 (or some other key) and it will jump to the definition of that function (or type).
A little background: I'm trying to build an AVR binary for an embedded sensor system, and I'm running close to my size limit. I use a few external libraries to help me, but they are rather large when compiled into one object per library. I want to pull these into smaller objects so only the functionality I need is linked into my program. I've already managed to drop the binary size by 2k by splitting up a large library.
It would help a lot to know which objects are being used at each stage of the game so I can split them more efficiently. Is there a way to make ld print which objects it's linking?
I'm not sure about the object level, but I believe you might be able to tackle this on the symbol level using CFLAGS="-fdata-sections -ffunction-sections" and LDFLAGS="-Wl,--gc-sections -Wl,--print-gc-sections". This should get rid of the code for all unreferenced symbols, and display the removed symbols to you as well which might be useful if for some reason you decide to go back to the object file level and want to identify object files only containing removed symbols.
To be more precise, the compiler flags I quoted will ask the compiler to place each function or global variable in a section for itself, and the --gc-sections linker flag will then remove all the sections which have not been used. It might be that each object file contains its own sections, even if the functions therein all share a single section. In that case the linker flag alone should do what you ask for: eliminate whole objects which are not used. The gcc manual states that the compiler flags will increase the object size, and although I hope that the final executable should not be affected by this, I don't know for sure, so you should give the LDFLAGS="-Wl,--gc-sections by itself a try in any case.
The listed option names might be useful keywords to search on stackoverflow for other suggestions on how to reduce the size of the binary. gc-sections e.g. yields 62 matches at the moment.
I created a script to remove useless code in many c++ libs (like ifdefs, comments, etc.)
Now, I want to compare the original lib and the "treated" lib to check if my script has done a good job.
The only solution I found is to compare the exported symbols.
I'm wondering if you have any other ideas to check the integrity?
FIRST of all: Unit tests are designed for this purpose.
You might get some mileage out of
compiling without optimization (-O0) and without debug information (or strip it afterwards)
objdump -dCS
and compare the disassemblies. Prepare to meet some / many spurious errors (the strip step was there to prevent needless differences in source line number info). In particular you will have to
ignore addresses
ignore generated label names
But if the transformation would really lead to unmodified code, you'd be able to verify it 1:1 using this technique and a little work.
assert based unit test would help you. Have some test cases , run them against the original library and then run with the code removed .
I compiled a dll file with a whole bunch of cpp files. I want to see how much each cpp contributes to the final size of the dll, in order to cut down its size (say by excluding some libraries). Is there any way to do that? Thank you!
This ranges from quite difficult (which object do you charge library functions against) to impossible (when whole program optimization is used to inline across compilation unit boundaries).
I also suggest that it's not very useful. You need to know which functions to target for slimming down, not just which files.
Generating a map file during the build (pass /MAP to LINK.EXE) is probably the best you can do. The documentation also mentions something about symbol groups, which you might be able to use to your advantage as well.
In the last year I've started programming in Fortran working at a research university. Most of my prior experience is in web languages like PHP or old ASP, so I'm a newbie to compile statements.
I have two different code I'm modifying.
One has an explicit statement creating .o files from modules (e.g. gfortran -c filea.f90) before creating the executable.
Another are creating the executable file directly (sometimes creating .mod files, but no .o files, e.g. gfortran -o executable filea.f90 fileb.f90 mainfile.f90).
Is there a reason (other than, maybe, Makefiles) that one method is preferred over the other?
Compiling to object files first is called separate compilation. There are many advantages and a few drawbacks.
Advantages:
easy to transform object files (.o) to libraries and link to them later
many people can work on different source files at the same time
faster compiling (you don't compile the same files again and again when the source hasn't changed)
object files can be made from different language sources and linked together at some later time. To do that, the object files just have to use the same format and compatible calling conventions.
separate compilation enables distribution of system wide libraries (either OS libraries, language standard libraries or third party libraries) either static or shared.
Drawbacks:
There are some optimizations (like optimizing functions away) that the compiler cannot perform, and the linker does not care about; however, many compilers now include the option to perform "link time optimization", which largely negates this drawback. But this is still an issue for system libraries and third party libraries, especially for shared libraries (impossible to optimize away parts of a component that may change at each run, however other techniques like JIT compilation may mitigate this).
in some languages, the programmer has to provide some kind of header for the use of others that will link with this object. For example in C you have to provide .h files to go with your object files. But it is good practice anyway.
in languages with text based includes like C or C++, if you change a function prototype, you have to change it in two places. Once in header file, once in the implementation file.
When you have a project with a few 100 source files, you don't want to recompile all of them every time one changes. By compiling each source file into a separate object file and only recompile those source files that are affected by a change, you spend the minimum amount of time from source code change to new executable.
make is the common tool used to track such dependencies and recreate your binary when something changes. Typically you set up what each source file depends on (these dependencies can typically be generated by your compiler - in a format suitable for make), and let make handle the details of creating an up to date binary.
The .o file is the Object File. It's an intermediate representation of the final program.
Specifically, typically, the .o file has compiled code, but what it does not have is final addresses for all of the different routines or data.
One of the things that a program needs before it can be run is something similar to a memory image.
For example.
If you have your main program and it calls a routine A. (This is faux fortran, I haven't touched in decades, so work with me here.)
PROGRAM MAIN
INTEGER X,Y
X = 10
Y = SQUARE(X)
WRITE(*,*) Y
END
Then you have the SQUARE function.
FUNCTION SQUARE(N)
SQUARE = N * N
END
The are individually compiled units. You can see than when MAIN is compiled it does not KNOW where "SQUARE" is, what address it is at. It needs to know that so when it calls the microprocessors JUMP SUBROUTINE (JSR) instruction, the instruction has someplace to go.
The .o file has the JSR instruction already, but it doesn't have the actual value. That comes later in the linking or loading phase (depending on your application).
So, MAINS .o file has all of the code for main, and a list of references that it wants to resolved (notably SQUARE). SQUARE is basically stand alone, it doesn't have any references, but at the same time, it had no address as to where it exists in memory yet.
The linker will take all off the .o files and combine them in to a single exe. In the old days, compiled code would literally be a memory image. The program would start at some address and simply loaded in to RAM wholesale, and then executed. So, in the scenario, you can see the linker taking the two .o files, concatenating them together (to get SQUAREs actual address), then it would go back and find the SQUARE reference in MAIN, and fill in the address.
Modern linkers don't go quite that far, and defer much of that final processing to when the program is actually loaded. But the concept is similar.
By compiling to .o files, you end up with reusable units of logic that are then combined later by the linking and loading processes before execution.
The other nice aspect is that the .o files can come from different languages. As long as the calling mechanisms are compatible (i.e. how are arguments passed to and from functions and procedures), then once compiled in to a .o, the source language becomes less relevant. You can link, combine, C code with FORTRAN code, say.
In PHP et all, the process is different because all of the code is loaded in to a single image at runtime. You can consider the FORTRANs .o files similar to how you would use PHPs include mechanism to combine files in to a large, cohesive whole.
Another reason, apart from compile time, is that the compilation process is a multi-step process.
The object files are just one intermediate output from that process. They will eventually be used by the linker to produce the executable file.
We compile to object files to be able to link them together to form larger executables. That is not the only way to do it.
There are also compilers that don't do it that way, but instead compiles to memory and executes the result immediately. Earlier, when students had to use mainframes, this was standard. Turbo Pascal also did it this way.