What is a symbol table? - c++

Can someone describe what a symbol table is within the context of C and C++?

There are two common and related meaning of symbol tables here.
First, there's the symbol table in your object files. Usually, a C or C++ compiler compiles a single source file into an object file with a .obj or .o extension. This contains a collection of executable code and data that the linker can process into a working application or shared library. The object file has a data structure called a symbol table in it that maps the different items in the object file to names that the linker can understand. If you call a function from your code, the compiler doesn't put the final address of the routine in the object file. Instead, it puts a placeholder value into the code and adds a note that tells the linker to look up the reference in the various symbol tables from all the object files it's processing and stick the final location there.
Second, there's also the symbol table in a shared library or DLL. This is produced by the linker and serves to name all the functions and data items that are visible to users of the library. This allows the system to do run-time linking, resolving open references to those names to the location where the library is loaded in memory.
If you want to learn more, I suggest John Levine's excellent book "Linkers and Loaders".link text

Briefly, it is the mapping of the name you assign a variable to its address in memory, including metadata like type, scope, and size. It is used by the compiler.
That's in general, not just C[++]*. Technically, it doesn't always include direct memory address. It depends on what language, platform, etc. the compiler is targeting.

In Linux, you can use command:
nm [object file]
to list the symbol table of that object file. From this printout, you may then decipher the in-use linker symbols from their mangled names.

The symbol table is the list of "symbols" in a program/unit. Symbols are most often the names of variables or functions. The symbol table can be used to determine where in memory variables or functions will be located.

Check out the Symbol Table wikipedia entry.

Symbol table is an important data structure created and maintained by compilers in order to store information about the occurrence of various entities such as variable names, function names, objects, classes, interfaces, etc.

From the "Computer Systems A Programmer’s Perspective" book, Ch 7 Linking. "Symbols and Symbol Tables":
Symbol table is information about functions and global variables that
are defined and referenced in the program
And important note (form the same chapter):
It is important to realize that local linker symbols are not the same
as local program variables. The symbol table does not contain any
symbols that correspond to local nonstatic program variables. These
are managed at run time on the stack and are not of interest to the
linker

Related

Will static variables with same name in two seperate function conflict? [duplicate]

AFAIK, we can have two static variables with the same name in different functions? How are these managed by the compiler and symbol table? How are their identities managed seperately?
Compilers don't store static variables' names in the linking symbol table. They are just some memory that is part of the module as far as the linker is concerned. (this may not be 100% true in all cases but it is effectively true)
The names of static variables are usually included within the debugging symbol table.
When you feed a .c file to the compiler it keeps up with the names of all known symbols so that it can recognize them for what they are when they come up in future code. It also remembers them so that it can give useful error/warning messages, but it pretty much forgets about them when generating output files (unless debugging symbols are being generated).
They are likely mangled in the table, in a similar way to how overloaded functions are implemented.
See dumpbin /symbols foo.obj if you want to peek at the table, or use objdump on linux.
It depends on the compiler, but some embedded ones simply add a number to the end of each duplicate name. That way each variable has a unique name.

How can I easily generate a list of symbols with static storage?

We have a large C++ project that we're building with both GCC and MSVC, and we've encountered the static initialization order fiasco. Is there a way to generate a list of symbols that take part in static initialization so I can generate a plan to resolve the problem?
I've created a map file from both GCC and MSVC. MSVC's output didn't look very useful. It appears that GCC's map file could be used - I extracted everything related to the bss section. However, many of the symbols come from libraries and just add noise to the information.
Is there a trick or some other convenient way to obtain the information I'm looking for (short of reading every source file manually)?
For Visual C++: Sort the lines of the .map file. This will ensure the symbols are ordered by address.
Search for symbols __xc_a and __xc_z. The symbols that appear between those two symbols are all of the dynamic initializers for objects with static storage duration. The initializers will be executed in the order they appear in the list.
Each entry in the .map file includes both
the name of the global variable (e.g., the initializer for global variable fred will be fred$initializer$, plus the required C++ name decoration), and
the object file that contained the global variable (e.g. fred.obj). If the symbol came from a static library, the static library will be listed (e.g. libfred:fred.obj).
(I don't know enough about GCC to answer how to do this with their tools.)

Satisfy external symbols with static library exports

I'm trying to hook GCC's __cxa_throw so I'll be able to receive exception backtraces without cluttering exception objects themselves. While it worked well with executable, it doesn't work with shared objects.
I'm constructing tiny static library which contains hook itself and redirects it to my function in a separate SO. And my __cxa_throw appears nicely as exported symbol in .text section. Though it doesn't help, even with LD's -Bsymbolix, -Bsymbolic-functions.
So the question is. Can I force LD to find my existing implementation of __cxa_throw and satisfy external symbol with it?
Thanks
First and foremost, your hook function must be compiled into object file and linked into resulting SO as object file. Seems that symbols from object files have higher priority than even default imports like libstd. This should already satisfy all symbol references inside whole SO, including archives being linked in.
Second, if you want your hook to not be visible from outside world, just use __attribute__((visibility("hidden")))

Does Symbol table for C++ code contain function names along with class names?

I have been searching through various posts regarding whether symbol table for a C++ code contains functions' name along with the class name. Something which i could find on a post is that it depends on the type of compiler,
if it compiles code in one-pass then it will not need to store class name and subroutine names in your symbol table
but if it is a multi-pass compiler, it could add information about the class(es) it encounters and their subroutines so that it could do argument type checking and issue meaningful error messages.
I could not understand whether it is actually compiler dependent or not? I was assuming that compiler(for C++ code) would put function names with class names in the table whether it is single pass or multi pass compiler. How is it dependent on the passes? I don't have such a great/deep knowledge.
Moreover, could anyone show a sample symbol table for a simple C++ class, how would it look like (function names with class name)?
Most compiler textbooks will tell you about symbol tables, and often show you details about a modest complexity langauge such as Pascal. You won't find information about C++ symbol tables in a textbook; it is too arcane.
We offer a complete C++14 front end for our DMS Software Reengineering Toolkit. It parses C++, builds detailed ASTs, and performs name-and-type resolution, which includes building a precise symbol table.
What follows are slides from our tutorial on how to use DMS, focused on the C++ symbol table structures.
OP asked specifically for a view of what happens with classes. The following diagram shows this for the tiny C++ program in the upper left corner. The rest of the diagram shows boxes, which represent what we call "symbol spaces" (or "scopes"), which are essentially hash tables mapping symbol names (each box lists the symbols it owns) to the information that DMS knows about that symbol (source file location of definition, list of AST nodes that reference the definition, and a complex union that represents the type, and that may in turn point to other types). The arrows show how symbol spaces are connected; an arrow from space A to space B means "scope A is contained within scope B". Typically the symbol space lookup process, searching scope A for a symbol x, will continue the search in scope B if x is not found in A. You'll note the arrows are numbered with an integer; this tells the search machinery to look in the least-numbered parent scope first, before trying to search scopes using arrows with larger numbers. This is how scopes are ordered (note Class C inherits from A and B; any lookup of a field in class C such as "b" will be forced to first look in the scope for A, and then in the scope for B. In this way, the C++ lookup rules are achieved.
Note the the class names are recorded in the (unique) global namespace because they is declared at top level. If they had been defined in some explicit namespace, then the namespace would have a corresponding symbol space of its own that recorded the declared classes, and the namespace itself would be recorded in the global symbol space.
OP did not ask what the symbol table looks like for function bodies, but I just so happen to have an illustrative slide for that that, too, below.
The symbol spaces work the same way. What is shown in this slide is the linkage between a symbol space, and the scoped region it represents. That linkage is actually implemented by a pointer associated with the symbol space, to the corresponding AST(s, namespace definitions can be scattered around in multiple places).
Note that in this case, the function name is recorded in the global namespace because it is declared at top level. If it had been defined inside the scope of a class, the function name would have been recorded in the symbol space for the class body (on previous diagram).
As a general rule, the details of how the symbol table is organized is completely dependent on the compiler, and the choices the designers made. In our case, we designed a very general symbol table management package because we planned (and have) used the same package to handle multiple languages (C, C++, Java, COBOL, several legacy languages) in a uniform way.
However, the abstract structures of symbol spaces and inheritance will have to implemented in essentially equivalent ways across C++ compilers; after all, they have to model the same information. I'd expect similar structures in the GCC and Clang compilers (well, the integer-numbered inheritance arcs, maybe not :)
As a practical matter, it doesn't matter how many "passes" your compiler has. It pretty much has to build these structures to remember what it knows about the symbols, within a pass, and across passes.
While building a C++ parser is very hard by itself, building such a symbol table is much harder. The effort dwarfs the effort to build the C++ parser. Our C++ name resolver is some 250K SLOC of attribute-grammar code compiled and executed by DMS. Getting the details rights is an enormous headache; the C++ reference manual is enormous, confusing, the facts are scattered everywhere across the document, and in a variety of places it is contradictory (we try to send complaints about this to the committee) and or inconsistent between compilers (we have versions for GCC and Visual Studio 201x).
Update March 2017: Now have symbol tables for C++2014.
Update June 2018: Now have symbol tables for C++2017.
A symbol table maps names to constructs within the program. As such it is used to record the names of classes, functions, variables, and anything else that has a user-specified name within the program.
(There are two common kinds of symbol table - one that the compiler maintains when it is compiling your program, and another that exists in object file so that it can be linked to other objects. The two are strongly related, but need not have similar representation internally. Typically only some of the symbols from the compiler's symbol table will be output into the object).
Part of what you say makes no sense:
if it compiles code in one-pass then it will not need to store class name and subroutine names in your symbol table
How can the compiler determine to what construct a name refers if it cannot look it up in the symbol table?
but if it is a multi-pass compiler, it could add information about the class(es) it encounters and their subroutines so that it could do argument type checking and issue meaningful error messages.
There's no reason it could not do this in a single pass.
I could not understand whether it is actually compiler dependent or not?
All compilers are going to use a symbol table, but its use will be hidden inside the implementation.
I was assuming that compiler(for C++ code) would put function names with class names in the table whether it is single pass or multi pass compiler. How is it dependent on the passes?
How is what dependent on the passes? All names go in the symbol table - that's what it's for - and usually symbol resolution is important for just about everything else the compiler does, so it needs to be done early (i.e. in the first pass - and in fact the main purpose of the first pass in a multi-pass compiler compiler may well be just to build the symbol table!).
Moreover, could anyone show a sample symbol table for a simple C++ class, how would it look like (function names with class name)?
I'll give it a stab:
class A
{
int a;
void f(int, int);
};
Will yield a symbol table containing symbols "A", "a", and "f". Typically "a" and "f" would be marked with a scope to simplify lookup, eg:
"A" -> (class)
"A::a" -> (class variable member)
"A::f(int,int)" -> (class function member)
It's also possible that the a and f symbols will not be stored in the top-level symbol table, but rather that each name space (including C++ namespaces and classes) will have its own symbol table, containing the symbols defined inside it. But this is, arguably, just a data structure choice. You can still abstractly view the symbol table as a flat table, where a name maps to a construct.
In general the "A::a" symbol would not be output to the object file, since it is not required for linking.
Short answer: yes, using 'nm --demangle' on linux
Long answer: The functions in the symbol table contain the function name plus the return value and if it is belongs to a class, the class name too. But the names,types (not always) and classes are not written with it's fulls names to use less space. This strings called demangle. But you know that this short name is unique and you can parse the full class name from it. To view the symbol table of your program you can use 'nm' on linux.
http://linux.about.com/library/cmd/blcmdl1_nm.htm
It got the --demangle flag to view the original names. You can compile random short programs to see what comes out.

How does linkage and name mangling work?

Lets take this code sample
//header
struct A { };
struct B { };
struct C { };
extern C c;
//code
A myfunc(B&b){ A a; return a; }
void myfunc(B&b, C&c){}
C c;
Lets do this line by line starting from the code section.
When the compiler sees the first myfunc method it does not care about A or B because its use is internal. Each c++ file will know what it takes in, what it returns. Although there needs to be a name for each of the two overload so how is that chosen and how does the linker know which means what?
Next is C c; I once had a bug were the linker wouldnt reconize thus allow me access to C in other C++ files. It was because that cpp didnt know c was extern and i had to mark it as extern in the header before i could link successfully. Now i am not sure if the class type has any involvement with the linker and the variable C. I dont know how RTTI will be involved but i do know C needs to be visible by other files.
How does the linker work and name mangling and such.
We first need to understand where compilation ends and linking begins. Compilation involves taking a compilation unit (a C or C++ source file) and turning it into an object file. Simplistically, this involves generating snippets of machine code for each function as well as a symbol table for all functions and static (global) variables. Placeholders are used for any symbols needed by the compilation unit that are external to the said compilation unit.
The linker is then responsible for loading all the object files and resolving all the place-holder symbols with real addresses (or offsets for machine independent code). This is placed into various sections that can be read by the operating system's dynamic loader when loading an executable.
So for the specifics. In order to avoid errors during linking, the compiler requires you to declare all external symbols that will be used by the current compilation unit. For global variables one must use the extern keyword, for functions this is optional.
All functions and global variables defined in a compilation unit have external linkage (i.e., can be referenced by other compilation units) unless one declares that with the static keyword (or the unnamed namespace in C++). In C++, the vtable will also have a symbol needed for linkage.
Now in C++, since functions can be overloaded, the parameters also form part of the function name. Since machine-code is just addresses and registers, extra information needs to be added to the function name in the symbols table. This extra parameter information comes in the form of a mangled name and ensures that the linker links to the correct version of an overloaded function.
If you really are interested in the gory details take a look at the ELF file format (PDF) used extensively on Linux. Windows has a different format but the principles can be expected to be the same.
Name mangling on the Itanuim (and ARM) platforms can be found here.
http://en.wikipedia.org/wiki/Name_mangling