Creating a modular language in LLVM? - c++

I'm developing a new language in LLVM using the C++ API which compiles down to target the C ABI.
I would like to support modular compilation by allowing end users to build what are effectively static libraries. I noticed the LLVM C++ API has a llvm::Linker class that I can use during compilation to combine source files (llvm::Module), however I wanted to guarantee library compatibility via metadata version numbers or at least the publicly exposed interface between separate compilation runs.
Much of the information available on metadata in LLVM suggest that it should only be used for extended information that would not break correctness when silently removed.
llvm
blog
IntrinsicsMetadataAttributes
pdf
I wouldn't think this would be a deal breaker as it could be global metadata, but it would be good to get a second opinion on that point.
I also know there is a method in IRReader to parseIRFile so I can load some previously built bc files. I would be curious if it would be reasonable practice to include size and CRC information for comparison when loading these files.
My language has concepts similar to C# including interfaces. I figure I could allow modular compilation by importing/exporting an interface type along with external functions (Much like C++, I don't restrict the language to only methods of classes).
This approach allows me to include language specific information in the interface without needing to encode it in the IR as both the library and the calling code would be required to build with the interface. This again requires the interfaces to be compatible.
One language feature that would require extended information would be named parameters in functions.
My language is very type-safe and also mandates named parameters so there is no predetermined function parameter order. This allows call sites to be more explicit, the compiler to catch erroneous parameter usage, and authors have more liberty in determining default parameters as they are not restricted to the last parameters to the function.
The compiler will need to know names, modifiers, defaults, etc. of these parameters to correctly map calls at compile time, so I figure the interface approach would work well here.
TL;DR
Does LLVM have any predefined facilities for building static libraries?
Is version number, size, and CRC information reasonable use cases for LLVM's metadata?

This is probably not QUITE an answer... Or at least not a complete answer.
I like this question, as I'm going to need a solution in the future too (some time in the next few months or years) for my Pascal compiler. It supports "units" which is meant to be a separately compiled object, but currently what I do is simply drag in the source file and compile it into the main llvm::Module - that's neither efficient nor flexible (can't use the linker to choose between the "Linux" and "Windows" version of some code, for example - not that I think there is 5% chance that my compiler will work on Windows without modification anyway...)
However, I'm not sure storing the "object" file as LLVM IR would be the right thing to do. I was thinking that a better way would be to store your AST in some serialized form - then
you don't depend on LLVM versions changing the IR format.
You can add whatever metadata you like. There won't be much
difference in generating LLVM-IR from this during your link phase or
building the IR at compile and then reading the IR to figure out if
the metadata is correct. [The slow part, as you may have already found out, is the optimisation and MC generation, and you'd still have to do that either way]
Like I started out, I'm not sure this is an answer, but it's my thoughts so far on the subject. Now I'll go back to adding debug symbol stuff to my Pascal compiler... Before Christmas, I couldn't see the source in GDB. Now I can step, but no viewing of variables yet...

Related

C++ Introspection: Enumerate available classes and methods in a C++ codebase

I'm working on some custom C++ static code analysis for my PHD thesis. As part of an extension to the C++ type system, I want to take a C++ code base and enumerate its available functions, methods, and classes, along with their type signatures, with minimal effort (it's just a prototype). What's the best approach to doing something like this quickly and easily? Should I be hacking on Clang to spit out the information I need? Should I look at parsing header files with something like SWIG? Or is there an even easier thing I could be doing?
GCCXML, based on GCC, might be the ticket.
As I understand it, it collects and dumps all definitions but not the content of functions/methods.
Others will likely mention CLANG, which certainly parses code and must have access to the definitions of the symbols in a compilation unit. (I have no experience here).
For completeness, you should know about our DMS Software Reengineering Toolkit
with its C++ Front End. (The CLANG answers seem to say "walk the AST"). The DMS solution provides an enumerable symbol table containing all the type information. You can walk the AST, too, if you want.
Often a static analysis leads to a diagnosis, and a desire to change the source code.
DMS can apply source-to-source program transformations to carry out such changes conditioned
by the analysis.
I heartily recommend LLVM for statical analysis (see also Clang Static Analyzer)
I think your best bet is hacking on clang and getting the AST. There is a good tutorial for that here. Its very easy to modify its syntax and it also has a static analyzer.
At my work, I use the API from a software package called "Understand 4 C++" by scitools. I use this to write all my static analysis tools. I even wrote a .NET API to wrap their C API. Which I put on codeplex.
Once you have that, dumping all class types is easy:
ClassType[] allclasses = Database.GetAllClassTypes()
foreach (ClassType c in allclasses)
{
Console.WriteLine("Class Name: {0}", c.NameLong);
}
Now for a little backstory about a task I had that is similar to yours.
In some years we have to keep our SDK binary backwards compatible with the previous years SDK. In that case it's useful to compare the SDK code between releases to check for potential breaking changes. However with a couple of hundred files, and tens of thousands of lines of comments that can be a big headache using a text diff tool like Beyond Compare or Araxis. So what I really need to look at is actual code changes, not re-ordering, not moving code up and down in the file, not adding comments etc...
So, a tool I wrote to dump out all the code.
In one text file I dump all all the classes. For each class I print its inheritance tree, its member functions both virtual and non-virtual. For each virtual function I print what parent class virtual methods it overrides (if any). I also print out its member variables.
Same goes with the structs.
In another file I print all the macro's.
In another file I print all the typedefs.
Then using this I can diff these files with files from a previous release. It then becomes apparent instantly what has changed from release to release. For instance it's easy to see where a function parameter was changed from TCHAR* to const TCHAR* for instance.
You might consider developping a GCC Plugin for your purpose.
And GCC MELT is a high level domain specific language (that I designed & implemented) for easily extend GCC.
The paper at GROW09 workshop by Peter Collingbourne and Paul Kelly on a A Compile-Time Infrastructure for GCC Using Haskell might be relevant for your work.

Open-source C++ scanning library

Rationale: In my day-to-day C++ code development, I frequently need to
answer basic questions such as who calls what in a very large C++ code
base that is frequently changing. But, I also need to have some
automated way to exactly identify what the code is doing around a
particular area of code. "grep" tools such as Cscope are useful (and
I use them heavily already), but are not C++-language-aware: They
don't give any way to identify the types and kinds of lexical
environment of a given use of a type or function a such way that is
conducive to automation (even if said automation is limited to
"read-only" operations such as code browsing and navigation, but I'm
asking for much more than that below).
Question: Does there exist already an open-source C/C++-based library
(native, not managed, not Microsoft- or Linux-specific) that can
statically scan or analyze a large tree of C++ code, and can produce
result sets that answer detailed questions such as:
What functions are called by some supplied function?
What functions make use of this supplied type?
Ditto the above questions if C++ classes or class templates are involved.
The result set should provide some sort of "handle". I should be able
to feed that handle back to the library to perform the following types
of introspection:
What is the byte offset into the file where the reference was made?
What is the reference into the abstract syntax tree (AST) of that
reference, so that I can inspect surrounding code constructs? And
each AST entity would also have file path, byte-offset, and
type-info data associated with it, so that I could recursively walk
up the graph of callers or referrers to do useful operations.
The answer should meet the following requirements:
API: The API exposed must be one of the following:
C or C++ and probably is "C handle" or C++-class-instance-based
(and if it is, must be generic C o C++ code and not Microsoft- or
Linux-specific code constructs unless it is to meet specifics of
the given platform), or
Command-line standard input and standard output based.
C++ aware: Is not limited to C code, but understands C++ language
constructs in minute detail including awareness of inter-class
inheritance relationships and C++ templates.
Fast: Should scan large code bases significantly faster than
compiling the entire code base from scratch. This probably needs to
be relaxed, but only if Incremental result retrieval and Resilient
to small code changes requirements are fully met below.
Provide Result counts: I should be able to ask "How many results
would you provide to some request (and no don't send me all of the
results)?" that responds on the order of less than 3 seconds versus
having to retrieve all results for any given question. If it takes
too long to get that answer, then wastes development time. This is
coupled with the next requirement.
Incremental result retrieval: I should be able to then ask "Give me
just the next N results of this request", and then a handle to the
result set so that I can ask the question repeatedly, thus
incrementally pulling out the results in stages. This means I
should not have to wait for the entire result set before seeing
some subset of all of the results. And that I can cancel the
operation safely if I have seen enough results. Reason: I need to
answer the question: "What is the build or development impact of
changing some particular function signature?"
Resilient to small code changes: If I change a header or source
file, I should not have to wait for the entire code base to be
rescanned, but only that header or source file
rescanned. Rescanning should be quick. E.g., don't do what cscope
requires you to do, which is to rescan the entire code base for
small changes. It is understood that if you change a header, then
scanning can take longer since other files that include that header
would have to be rescanned.
IDE Agnostic: Is text editor agnostic (don't make me use a specific
text editor; I've made my choice already, thank you!)
Platform Agnostic: Is platform-agnostic (don't make me only use it
on Linux or only on Windows, as I have to use both of those
platforms in my daily grind, but I need the tool to be useful on
both as I have code sandboxes on both platforms).
Non-binary: Should not cost me anything other than time to download
and compile the library and all of its dependencies.
Not trial-ware.
Actively Supported: It is likely that sending help requests to mailing lists
or associated forums is likely to get a response in less than 2
days.
Network agnostic: Databases the library builds should be able to be used directly on
a network from 32-bit and 64-bit systems, both Linux and Windows
interchangeably, at the same time, and do not embed hardcoded paths
to filesystems that would otherwise "root" the database to a
particular network.
Build environment agnostic: Does not require intimate knowledge of my build environment, with
the notable exception of possibly requiring knowledge of compiler
supplied CPP macro definitions (e.g. -Dmacro=value).
I would say that CLang Index is a close fit. However I don't think that it stores data in a database.
Anyway the CLang framework offer what you actually need to build a tool tailored to your needs, if only because of its C, C++ and Objective-C parsing / indexing capabitilies. And since it's provided as a set of reusable libraries... it was crafted for being developed on!
I have to admit that I haven't used either because I work with a lot of Microsoft-specific code that uses Microsoft compiler extensions that i don't expect them to understand, but the two open source analyzers I'm aware of are Mozilla Pork and the Clang Analyzer.
If you are looking for results of code analysis (metrics, graphs, ...) why not use a tool (instead of API) to do that? If you can, I suggest you to take a look at Understand.
It's not free (there's a trial version) but I found it very useful.
Maybe Doxygen with GraphViz could be the answer of some of your constraints but not all,for example the analysis of Doxygen is not incremental.

Easy way to get function prototypes?

A friend and I were discussing imaginary and real languages and a question that came up was if one of us wanted to generate headers for another language (perhaps D which already has a tool) what would be an easy and very good way to do this?
One of us said to scan C files and headers and ignore function bodies and only count the braces within to figure out when a function is finished. The counter to that was typedefs, defines (which braces but defines were considered as a trivial problem) and templates + specialization.
Another solution was to read binaries produce, not the actual exe but the object files the linker uses. The counter to that was the format and complexity. None of us knew anything of any object format so we couldnt estimate (we were thinking of gcc and VS c++).
What do you guys think? Which is easier? This should be backed up with reasonable logic and fact.
If someone can link to a helpful project, one that parses C files/headers and outputs it or one that reads in elf data and displays info in an example project would be useful. I tried googling but I didnt know what it would be called. I found libelf but at this moment I couldn't get it to compile. I might be able to soon.
You can use clang libraries to parse C/C++ source code and extract any information you want in particular function prototypes.
Due to library-based architecture it is easy to reuse parts of clang that you need. In your case these are frontend libraries (liblex, libparse, libsema). I think this is a more feasible approach then using hand-written scanner considering the difficulties that you mentioned (typedefs, defines, etc).
clang can also be used as a tool to parse the source code and output AST in XML form, for example if you have the file test.cpp:
void foo() {}
int main()
{
foo();
}
and invoke clang++ -Xclang -ast-print-xml -fsyntax-only test.cpp you'll get the file test.xml similar to the following (here irrelevant parts skipped for brevity):
<?xml version="1.0"?>
<CLANG_XML>
<TranslationUnit>
<Function id="_1D" file="f2" line="1" col="6" context="_2"
name="foo" type="_12" function_type="_1E" num_args="0">
</Function>
<Function id="_1F" file="f2" line="3" col="5" context="_2"
name="main" type="_21" function_type="_22" num_args="0">
</Function>
</TranslationUnit>
<ReferenceSection>
<Types>
<FunctionType result_type="_12" id="_1E"/>
<FundamentalType kind="int" id="_21"/>
<FundamentalType kind="void" id="_12"/>
<FunctionType result_type="_21" id="_22"/>
<PointerType type="_12" id="_10"/>
</Types>
<Files>
<File id="f2" name="test.cpp"/>
</Files>
</ReferenceSection>
</CLANG_XML>
I don't think that extracting this information from binaries is possible at least for symbols with C linkage, because they don't have name mangling.
ctags' output is quite easy to read/parse
if you want to simply generate a binding, try swig
What you're talking about is compiling: The act of transforming code in one formal language to another. There's a good solid science behind this that, if followed carefully, will guarantee your program halts with a correct analogous code.
Granted, you don't want to parse the whole of the C++ language (hooray for that!), so you just need to define the relevant grammar and define everything else as acceptable noise or comments.
Don't use regular expressions. These won't do because C++ is not a regular language.
One way to do this is to define your interfaces in an abstract language (an IDL), and generate headers for all languages that you're interested in. You can limit the scope of your IDL to those features that are possible in each target language.
Windows takes this approach in its MIDL language, for example.
Doxygen can help with this. It's an advanced topic, and somewhat documented.
In a perfect world...
Each compiler of any programming language would emit such information as part of its output. It could be an ELF extension or a new generally accepted file format, which contains an ELF/COFF/whatever section.
This would spare the (across the globe) thousands of man hours to generate "language bindings" for dozens of languages. Dynamic languages, such as Common Lisp would not need FFI libraries - as it all would happen under the hood and loading a shared library in that new format could automatically be inspected and functions in it could be made available on the fly, without any further ado.
And all, which actually would have to be stored as extra information for each (exported) function is:
Generic name
Return type
Argument list
calling convention (there are not that many in practice)
A reference to the symbol table entry, referred to.
Why has it not been done during the past 30 years?
Because language designers still see the compiler output as their "private affair", whereas IMHO, this should be part of the OS/Runtime/Loader.
Many workarounds exist, some listed here in other answers (IDL, binding generators, such as SWIG and a myriad of others, mostly ad hoc and of varying and typically insufficient quality).
The ELF guys are Unix guys and as such C guys, who still live in a bubble, thinking C is some sort of "golden standard". But even they would not have any problems, using the information to generate their beloved header files.
RUST might not have invented crates in that perfect world and you would not even care if some shared library had been written in C or C++ or Rust or Zig or Haskell - anyone could just use it.
This domain is still dominated by "computer scientists", not engineers; while in theory, there could be an infinite number of calling conventions, in fact, those in active use are very countable few (LLVM supports many of those actually relevant).
Another reason, it has not been done yet, is that there is a danger, some committee would create a monster, trying make it "open, flexible, ..." and no one would eventually dare using it. DCE (distributed computing environment) gives that idea.

How hide C++ source code from customer

I wish to send some components to my customers. The reasons I want to deliver source code are:
1) My class is templatized. Customer might use any template argument, so I can't pre-compile and send .o file.
2) The customer might use different compiler versions for gcc than mine. So I want him to do compilation at his end.
Now, I can't reveal my source code for obvious reasons. The max I can do is to reveal the .h file. Any ideas how I may achieve this. I am thinking about some hooks in gcc that supports decryption before compilation, etc. Is this possible?
In short, I want him to be able to compile this code without being able to peek inside.
Contract = good, obfuscation = ungood.
That said, you can always do a kind of PIMPL idiom to serve your customer with binaries and just templated wrappers in the header(s). The idea is then to use an "untyped" separately compiled implementation, where the templated wrapper just provides type safety for client code. That's how one often did things before compilers started to understand how to optimize templates, that is, to avoid machine-code level code bloat, but it only provides some measure of protection about trivial copy-and-paste theft, not any protection against someone willing to delve into the machine code.
But perhaps the effort is then greater than just reinventing your functionality?
Just adding some terminology to Alf's answer: The Thin template idiom is what you might look at. It basically simulates the functionality of a generic. Don't get confused by the wikipedia article which pops up in google, you don't have to use void*...
This, of course, does not guarantee binary compatibility. As usual with 'native' c++, you either compile the component for customers platform yourself and deploy the binary, or give them your code... The difference to the pure generic component code is that you can do the former at all.
use some c++ obfuscators may be help?: http://www.semdesigns.com/products/obfuscators/CppObfuscationExample.html or Magle It
First, if you're going to provide the source code, then you have to provide the source code. Sure, you could encrypt it, but even if GCC had a "decrypt before compile" option, it would need to decrypt the code, and if GCC can decrypt the code, so can your customer.
What you're asking is impossible. (If you find a way to do it, I believe the movie industry might have a multi-million contract for you. They currently have to resort to expensive custom hardware to prevent people from ripping content, and that only works to a limited degree)
As for your "obvious reasons" why you don't want to provide the source code, I don't see why they're obvious. What would happen if you provided the source code?
You have two options:
provide the source code in its entirety, or
compile everything that can be precompiled into a (static or dynamic) library, and provide your customer with that, plus the header files.
what about pimpls?
1) My class is templatized. Customer might use any template argument, so I can't pre-compile and send .o file.
2) The customer might use different compiler versions for gcc than mine. So I want him to do compilation at his end.
Now, I can't reveal my source code for obvious reasons. The max I can do is to reveal the .h file. Any ideas how I may achieve this. I am thinking about some hooks in gcc that supports decryption before compilation, etc. Is this possible?
In short, I want him to be able to compile this code without being able to peek inside.
Consideration 2) above encompasses A) ABI differences such that the same code compiled with different compiler versions/vendors on the same platform is incompatible, as well as B) the differences in system libraries, kernel versions etc. that the code might be dependent on. The only general solution is to compile on the specific platforms. Either you do it for all platforms, or you give them all the source code and they do it. That's not just the headers and template implementation, that's your out-of-line functions too. You might mitigate A) a little by building a wall of more interoperable extern "C" functions, but you're basically stuck when it comes to B).
So, can you decrypt during compilation? Only if you ship your own hacked GCC binaries to them, built for their specific system, which is probably more hassle than providing different builds of your own libraries (though it may address the template/header exposure issue).
Alternatively, you could employ source code obfuscation techniques. This is probably - practically - as good as it gets. I don't know what tools are out there, but it's an approach that people have pursued for decades (though I'm yet to hear anyone recommend it), so there's sure to be some mature tools.
Re templated code - other people have suggested a templated front end to a C-style generic implementation shipped as a precompiled object. That may or may not be practical (clearly risks performance degradation, and you have to capture the set of type-specific operations you want - e.g. by instantiating a type-specific class derived from an abstract operations base class) but anyway the precompiled object still runs afoul of B).
One other thought... clients might take your source code, but are unlikely to understand it as well as you. Even if they build more systems dependent on their version of it, in a way they're getting more locked in, and may have more need for your services in future. And, if you see they've not played fair, you charge them for it appropriately when the time comes.
It seems with gcc 4.5 comes the support for plugins. So you can provide your own .so which would be, for instance, called before compilation stage starts. So you can have all kinds of tricks(decryption of source file) in there, neatly hidden. This would also be portable solution as no change is made to g++ per se.
This is exactly what I was looking for. You can read more here:
http://www.codesynthesis.com/~boris/blog/2010/05/03/parsing-cxx-with-gcc-plugin-part-1/

Where do I learn "what I need to know" about C++ compilers?

I'm just starting to explore C++, so forgive the newbiness of this question. I also beg your indulgence on how open ended this question is. I think it could be broken down, but I think that this information belongs in the same place.
(FYI -- I am working predominantly with the QT SDK and mingw32-make right now and I seem to have configured them correctly for my machine.)
I knew that there was a lot in the language which is compiler-driven -- I've heard about pre-compiler directives, but it seems like someone would be able to write books the different C++ compilers and their respective parameters. In addition, there are commands which apparently precede make (like qmake, for example (is this something only in QT)).
I would like to know if there is any place which gives me an overview of what compilers are out there, and what their different options are. I'd also like to know how each of them views Makefiles (it seems that there is a difference in syntax between them?).
If there is no website regarding, "Everything you need to know about C++ compilers but were afraid to ask," what would be the best way to go about learning the answers to these questions?
Concerning the "numerous options of the various compilers"
A piece of good news: you needn't worry about the detail of most of these options. You will, in due time, delve into this, only for the very compiler you use, and maybe only for the options that pertain to a particular set of features. But as a novice, generally trust the default options or the ones supplied with the make files.
The broad categories of these features (and I may be missing a few) are:
pre-processor defines (now, you may need a few of these)
code generation (target CPU, FPU usage...)
optimization (hints for the compiler to favor speed over size and such)
inclusion of debug info (which is extra data left in the object/binary and which enables the debugger to know where each line of code starts, what the variables names are etc.)
directives for the linker
output type (exe, library, memory maps...)
C/C++ language compliance and warnings (compatibility with previous version of the compiler, compliance to current and past C Standards, warning about common possible bug-indicative patterns...)
compile-time verbosity and help
Concerning an inventory of compilers with their options and features
I know of no such list but I'm sure it probably exists on the web. However, suggest that, as a novice you worry little about these "details", and use whatever free compiler you can find (gcc certainly a great choice), and build experience with the language and the build process. C professionals may likely argue, with good reason and at length on the merits of various compilers and associated runtine etc., but for generic purposes -and then some- the free stuff is all that is needed.
Concerning the build process
The most trivial applications, such these made of a single unit of compilation (read a single C/C++ source file), can be built with a simple batch file where the various compiler and linker options are hardcoded, and where the name of file is specified on the command line.
For all other cases, it is very important to codify the build process so that it can be done
a) automatically and
b) reliably, i.e. with repeatability.
The "recipe" associated with this build process is often encapsulated in a make file or as the complexity grows, possibly several make files, possibly "bundled together in a script/bat file.
This (make file syntax) you need to get familiar with, even if you use alternatives to make/nmake, such as Apache Ant; the reason is that many (most?) source code packages include a make file.
In a nutshell, make files are text files and they allow defining targets, and the associated command to build a target. Each target is associated with its dependencies, which allows the make logic to decide what targets are out of date and should be rebuilt, and, before rebuilding them, what possibly dependencies should also be rebuilt. That way, when you modify say an include file (and if the make file is properly configured) any c file that used this header will be recompiled and any binary which links with the corresponding obj file will be rebuilt as well. make also include options to force all targets to be rebuilt, and this is sometimes handy to be sure that you truly have a current built (for example in the case some dependencies of a given object are not declared in the make).
On the Pre-processor:
The pre-processor is the first step toward compiling, although it is technically not part of the compilation. The purposes of this step are:
to remove any comment, and extraneous whitespace
to substitute any macro reference with the relevant C/C++ syntax. Some macros for example are used to define constant values such as say some email address used in the program; during per-processing any reference to this constant value (btw by convention such constants are named with ALL_CAPS_AND_UNDERSCORES) is replace by the actual C string literal containing the email address.
to exclude all conditional compiling branches that are not relevant (the #IFDEF and the like)
What's important to know about the pre-processor is that the pre-processor directive are NOT part of the C-Language proper, and they serve several important functions such as the conditional compiling mentionned earlier (used for example to have multiple versions of the program, say for different Operating Systems, or indeed for different compilers)
Taking it from there...
After this manifesto of mine... I encourage to read but little more, and to dive into programming and building binaries. It is a very good idea to try and get a broad picture of the framework etc. but this can be overdone, a bit akin to the exchange student who stays in his/her room reading the Webster dictionary to be "prepared" for meeting native speakers, rather than just "doing it!".
Ideally you shouldn't need to care what C++ compiler you are using. The compatability to the standard has got much better in recent years (even from microsoft)
Compiler flags obviously differ but the same features are generally available, it's just a differently named option to eg. set warning level on GCC and ms-cl
The build system is indepenant of the compiler, you can use any make with any compiler.
That is a lot of questions in one.
C++ compilers are a lot like hammers: They come in all sizes and shapes, with different abilities and features, intended for different types of users, and at different price points; ultimately they all are for doing the same basic task as the others.
Some are intended for highly specialized applications, like high-performance graphics, and have numerous extensions and libraries to assist the engineer with those types of problems. Others are meant for general purpose use, and aren't necessarily always the greatest for extreme work.
The technique for using each type of hammer varies from model to model—and version to version—but they all have a lot in common. The macro preprocessor is a standard part of C and C++ compilers.
A brief comparison of many C++ compilers is here. Also check out the list of C compilers, since many programs don't use any C++ features and can be compiled by ordinary C.
C++ compilers don't "view" makefiles. The rules of a makefile may invoke a C++ compiler, but also may "compile" assembly language modules (assembling), process other languages, build libraries, link modules, and/or post-process object modules. Makefiles often contain rules for cleaning up intermediate files, establishing debug environments, obtaining source code, etc., etc. Compilation is one link in a long chain of steps to develop software.
Also, many development environments abstract the makefile into a "project file" which is used by an integrated development environment (IDE) in an attempt to simplify or automate many programming tasks. See a comparison here.
As for learning: choose a specific problem to solve and dive in. The target platform (Linux/Windows/etc.) and problem space will narrow the choices pretty well. Which you choose is often linked to other considerations, such as working for a particular company, or being part of a team. C++ has something like 95% commonality among all its flavors. Learn any one of them well, and learning the next is a piece of cake.