Does a C++ compiler generate machine code via assembly language code (i.e., c++ compiler first converts C++ code to assembly language code and then uses assembly language compiler to convert it to machine code), or is assembly language output generation just an option for reference or debugging purposes?
It doesn't have to, but most do it anyway, as the same assembler (program) can be used for the output of the C/C++/whatever-to-assembler compiler.
g++ for example generates assembler code first (you can see the generated assembler using the -S switch).
MSVC does it too (/FAs).
They used to, a long time ago, although that was typical only for C compilers. The first one I used worked that way, a long time ago. Not unusual for ones that generated code for unusual hardware and operating systems, they saved the cost of writing the object file generator and leveraged existing linkers.
Modern compilers don't bother with the extra step of generating machine code as text and running an assembler afterward, they generate the machine code directly in binary form. The compile speed advantage is fairly significant, text isn't cheap. The option to generate it in textual form from binary is pretty simple to implement.
Related
I have a C++ program that I want to compile to assembly, and then assembler will compile it to machine code.
Now, as far as I know, in order to transform assembly code to machine code the assembler needs some kind of table to map assembly instructions to the actual machine instructions.
Which table will the assembler use? Is there a chance that my C++ program won't run on all CPUs, because CPUs use different tables which means that the same machine code will do different things on different CPUs?
The assembler assembles for whatever architecture it has been told to/programed to assemble for. As the assembly language for each instruction set architecture (ISA) differs, you can only assemble an assembly program written for one architecture for that same architecture. It is generally not possible to accidentally or intentionally assemble the program for the wrong architecture.
When you use a compiler, the compiler invokes the correct assembler with the correct flags to assemble the assembly code it generated for the architecture you told it to compile for. The resulting program will only run on processors of the ISA your have compiled it for. If you want the program to run on processors of a different ISA, you have to compile it for that other ISA.
If your program is poorly written, it is possible that it won't compile or work when compiled for other architectures than the one(s) you developed it for. Such a program is called an unportable program. However, unless you do weird things or make assumptions about properties of the architecture you are programming for, this is unlikely to happen.
In general what is call assembly is roughly a human readable (text) form of machine code (binary).
As franji1 said in a comment, in general compilers emit an intermediate abstract machine code from the source. And this kind of code can easily (it is intended to) be translated to assembly/machine code.
I have a C++ program that I want to compile to assembly, and then
assembler will compile it to machine code.
This is what a compiler is designed to. Compiler is somehow misleading. Compiler can be the "compiler phase" or "compiler toolchain". compiler phase is the one that translate your source code to the intermediate abstract form, that then needs to be translated to target assembly/machine code by the assembler. Compilation is commonly what denotes the whole process from source code to executable machine code.
Now, as far as I know, in order to transform assembly code to machine
code the assembler needs some kind of table to map assembly
instructions to the actual machine instructions.
Roughly yes. This is what a document like Instruction Set Reference Manual is for: describing how textual form must be translated to byte form.
Which table will the assembler use?
See document...
Is there a chance that my C++ program won't run on all CPUs, because
CPUs use different tables which means that the same machine code will
do different things on different CPUs?
You have to generate a byte form of your program for each platform (machine/os). A compiler is designed to generate a machine code for a given platform that realizes exactly what your source code specifies. This is why compilers exist, to free you from writing program in assembly (that is very hard to do).
This question already has answers here:
Does a compiler always produce an assembly code?
(4 answers)
What do C and Assembler actually compile to? [closed]
(11 answers)
Closed 2 years ago.
I have started learning C++, and I have learned that a compiler turns source code from a program into machine code through compilation.
However, I've learned that C++ compilers actually translate the source code into Assembly as an interim step before translating the Assembly code into machine code. What is the purpose of this step?
Why don`t they translate it directly into the machine code?
First of all: There is no need to write an intermediate assembly language representation. Every compiler vendor is free to emit machine code directly.
But there are a lot of good reasons to "write" an intermediate assembly and pass it to an assembler to generate the final executable file. Important is, that there is no need to really write a file to some kind of media, but the output can directly piped to the assembler itself.
Some of the reasons why vendors are using intermediate assembly language:
The assembler is already available and "knows" how to generate some executable file formats ( elf for example ).
Some tasks can be postponed until assembly level is reached. Resolving jump targets for example. This is possible because the intermediate assembly is often not only 1:1 representation but some kind of "macro-assembler" which can do a lot more than simply creating bits from mnomics.
the assembler level is followed by executing the linker. This must also be done if a compiler directly wants to create executable file formats. A lot of duplicated jobs if this must be coded again. As an example all the relocation of before "unknown addresses" must be done on the way to an executable file. Simply use the assembler/linker and the job is done.
The intermediate assembly is always useful for debugging purpose. So there is a more or less hard requirement to be able to do this intermediate step, even if it can be omitted if no debug output is requested from the user.
I believe there are are lot more...
The bad side is:
"writing" a text representation and parsing the program from the text takes longer as directly passing the information to the linker.
Usually, compilers invoke the assembler (and the linker, or the archiver) on your behalf unless you ask it to do otherwise, because it is convenient.
But separating the distinct steps is useful because it allows you to swap the assembler (and linker and archiver) for another if you so desire or need to. And conversely, this assembler may potentially be used with other compilers.
The separation is also useful because assemblers already existed before the compiler did. By using a pre-existing assembler, there is no need to re-implement the machine code translation. This is still potentially relevant because occasionally there will be a need to boot-strap a new CPU architecture.
with reference to : http://www.cplusplus.com/articles/2v07M4Gy/
During the compilation phase,
This phase translates the program into a low level assembly level code. The compiler takes the preprocessed file ( without any directives) and generates an object file containing assembly level code. Now, the object file created is in the binary form. In the object file created, each line describes one low level machine level instruction.
Now, if I am correct then different CPU architectures works on different assembly languages/syntax.
My question is how does the compiler comes to know to which assembly language syntax the source code has to be changed? In other words, how does the C++ compiler know which CPU architecture is there in the machine it is working on ?
Is there any mapping used by assembler w.r.t the CPU architecture for generating assembly code for different CPU architectures?
N.S : I am beginner !!
Each compiler needs to be "ported" to the given system. For each system supported, a "compiler port" needs to be programmed by someone who knows the system in-depth.
WARNING : This is extremely simplified
In short, there are three main parts to a compiler :
"Front-end" : This part reads the language (in this case c++) and converts it to a sort of pseudo-code specific to the compiler. (An Abstract Syntactic Tree, or AST)
"Optimizer/Middle-end" : This part takes the AST and make a non-architecture-dependant optimized one.
"Back-end" : This part takes the AST, and converts it to binary executable code, specific to the architecture you want to compile your language on.
When you download a c++ compiler for your platform, you, in fact, download the c++ frontend with the linux-amd64 backend, for example.
This coding architecture is extremely helpful, because it allows to port the compiler for another architecture without rewriting the whole parsing/optimizing thing. It also allows someone to create another optimizer, or even another frontend supporting a whole different language, and, as long as it outputs a correct AST, it will be compatible with every single backend ever written for this compiler.
Simply put, the knowledge of the target system is coded into the compiler.
So you might have a C compiler that generates SPARC binaries, and a C compiler that generates VAX binaries. They both accept the same input language (as defined in the C standard), but produce different programs from it.
Often we just refer to "the compiler", meaning the one that will generate binaries for our current environment.
In modern times, the distinction has become less obvious with compiler collections such as GCC. Now the "different compilers" are often the same compiler program, just set up with different configurations (these are the "target description files").
Just to complete the answers given here:
The target architecture is indeed coded into the specific compiler instance you're using. This is important also for a process called "cross-compiling" - The process of compiling on a certain system an executable that would operate on another system/architecture.
Consider working on an embedded system-on-chip that uses a completely different instruction set than your own - You're working on an x86/64 Linux system, but need to compile a mobile app running on an ARM micro-processor, or some other type of assembly architecture.
It would be unreasonable to compile your code on the target system, which might be so limited in CPU and memory that it can't feasibly run a compiler - and so you can use a GCC (or any other compiler) port for that target architecture on your favorite system.
It's also quite critical to remember that the entire tool-chain is often compatible to the target system, for instance when shared libraries such as libc are getting in play - as the target OS could be a different release of Linux and would have different versions of common functions - In which case it's common to use tool-chains that contain all the necessary libraries and use something like chroot or mock to compile in the "target environment" from within your system.
So as I'm understood c++ code is comprised of assembly code, and when I compile a program it is read as its assembly equivelent and then run by the compiler. I'm also understood that assembly syntax and features change from model to model of proccessor. If this is so, how do compilers manage to compile programs without being littered with bugs? I mean, it can't be possible for a compiler to hold every assembly language variant created, is it?
I think you're confusing assembly code with machine code. It's not the same. Machine code is what the CPU executes - a byte stream of instructions and data. Assembly is a human readable representation of machine code.
It's indeed true that all C++ code is compiled into machine code, eventually. Yes, the instruction set changes between CPUs and CPU versions. Compilers have the notion of "target architecture" - when you compile, you have an option of specifying one. If you don't, the architecture of the current machine is usually assumed. Yes, compiler vendors have to extend an effort to support every flavor of CPU that they intend to support. Fortunately, there's not that many. Besides, in the C compilation process, code generation is not even the most complex step, so the majority of compiler's own code is not CPU specific.
Some compilers work via assembly - rather than generating machine code, they generate assembly and feed that to an assembler for the final stage of compilation. With that kind of design, your compiler normally assumes a certain flavor of assembler to be present on the system - typically GNU assembler (as).
I think you've misunderstood the meaning of "assembly code".
C++ code does not "consist" of assembly code; it consists of, well, C++ code.
A compiler translates this C++ code, ultimately into executable machine code that can be run on a computer (usually under the direction of an operating system).
Assembly code is a human-readable symbolic representation of machine code. Typically a line of assmembly code corresponds to a single CPU instruction of machine code. Assembly is a much lower level language than C++ (or even C).
Some C++ compilers generate assembly code as an intermediate step; the assembly code is then translated into executable machine code. Other C++ compilers skip that step and generate machine code directly (though they may have an option to produce a human-readable assembly listing).
Typically each compiler accepts input in a single high-level language (C, C++, etc.) and generates output for one CPU (x86, ARM, MIPS, etc). Compilers are commonly designed in phases, so that the portion that processes the high-level input language can be combined with the portion that generates machine-specific code. gcc is designed this way. There are front ends that process a number of different input languages, and code generators that generate code for different CPUs. Thus if you already have an Ada front end and a MIPS back end, it's not too difficult to join them together to create an Ada compiler that generates MIPS machine code.
As for how compilers manage to do with without being "littered with bugs", well, it's just a lot of work.
I have a task to create optimized C++ source code and give it to friend for compilation. It means, that I do not control the final compilation, I just write the source code of C++ program.
I know, that a can make optimization during compilation with -O1 (and -O2 and others) options of GCC. But how can I get this optimized source code instead of compiled program? I am not able to configure parameters of my friend's compiler, that is why I need to make a good source on my side.
The optimizations performed by GCC are low level, that means you won't get C++ code again but assembly code in best case. But you won't be able to convert it or something.
In sum: Optimize the source code on code level, not on object level.
You could ask GCC to dump its internal (Gimple, ...) representations, at various "stages". The middle-end of GCC is made of hundreds of passes, and you could ask GCC to dump them, with arguments like -fdump-tree-all or -fdump-gimple-all; beware that you can get hundreds of dump files for a single compilation!
However, GCC internal representations are quite low level, and you should not expect to understand them without reading a lot of material.
The dump options I am mentionning are mostly useful to those working inside GCC, or extending it thru plugins coded in C or extensions coded in MELT (a high-level domain specific language to extend GCC). I am not sure they will be very useful to your friend. However, they can be useful to make you understand that optimization passes do a lot of complex processing.
And don't forget that premature optimization is evil : you should first make your program run correctly, then benchmark and profile it, at last optimize the few parts worth of your efforts. You probably won't be able to write correct & efficient programs without testing and running them yourself, before giving them to your friend.
Easy - choose the best algorithm possible, let the rest be handled by the optimizer.
Optimizing the source code is different than optimizing the binary. You optimize the source code, the compiler will optimize the binary.
For anything more than algorithm choice, you'll need to do some profiling. Sure, there are practices that can speed up code speed, but some make the code less readable. Only optimize when you have to, and after you measure.