LLVM: Get function argument locations (ABI)

LLVM: Get function argument locations (ABI) - llvm

I'd like to write a LLVM plugin to tell me: for each function declaration, how many general purpose register, XMM register, and stack arguments the function will have. This only needs to work for the x86 SYSV calling convention. Is this currently possible? If so, any direction would be most appreciated.

It's possible but slightly tricky, LLVM seems to be much more friendly when you attempt to manipulate IR than when messing with Machine specific code. There's no easy way to add passes, for one.
Also note that the eventual number of registers used depends on how virtual registers are eventually allocated into real x86 logical register, where spilling may occur in some cases, and elimination of some variables in others. So you're not only prevented to work on IR where values are in SSA form, you actually need to work on the very last stages of CodeGen. To make things more interesting, it also depends on the level of optimization you use.
Check out - llvm/lib/CodeGen/Passes.cpp, look at addOptimizedRegAlloc or addFastRegAlloc, check out the passed called inside and use the last ones (implemented in separate files in CodeGen), to hitch a ride on. The internal loops should be pretty straight forward, but you'll need to run your bookkeeping based on the X86 target regs enumerated in TargetRegisterInfo

Related

Will compiler automatically optimize repeating code?

If I have some code with simple arithmetics that is repeating several times. Will the compiler automatically optimize it?
Here the example:
someArray[index + 1] = 5;
otherArray[index + 1] = 7;
Does it make sense to introduce variable nextIndex = index + 1 from the perfomance point of view, (not from the point of view of good readable and maintanable code) or the compiler will do such optimization automatically?

You should not worry about trivial optimization like this because almost all compilers do it last 10-15 years or longer.
But if you have a really critical place in your code and want to get maximal speed of running, than you can check generated assembler code for this lines to be sure that compiler did this trivial optimization.
In some cases one more arithmetic addition could be more faster version of code than saving in register or memory, and compilers knows about this. You can make your code slower if you try optimize trivial cases manually.
And you can use online services like https://gcc.godbolt.org for check generated code (support gcc, clang, icc in several version).

The old adage "suck it and see" seems to be appropriate here. We often forget that by far the most common processors are 4/8/16 bit micros with weird and wonderful application specific architectures and suitably odd vendor specific compilers to go with them. They frequently have compiler extensions to "aid" (or confuse) the compiler into producing "better" code.
One DSP from early 2000s carried out 8 instructions per clock-cycle in parallel in a pipeline (complex - "load+increment+multiply+add+round"). The precondition for this to work was that everything had to be preloaded into the registers beforehand. This meant that registers were obviously at a premium (as always). With this architecture it was frequently better to bin results to free registers and use free slots that couldn't be paralleled (some instructions precluded the use of others in the same cycle) to recalculate it later. Did the compiler get this "right"?. Yes, it often kept the result to reuse later with the result that it stalled the pipeline due to lack of registers which resulted in slower execution speed.
So, you compiled it, examined it, profiled it etc. to make sure that the when when the compiler got it "right" we could go in and fix it. Without additional semantic information which is not supported by the language it is really hard to know what "right" is.
Conclusion: Suck it and see

Yes. It's a common optimization. https://en.wikipedia.org/wiki/Common_subexpression_elimination

When should you not use [[carries_dependency]]?

I've found questions (like this one) asking what [[carries_dependency]] does, and that's not what I'm asking here.
I want to know when you shouldn't use it, because the answers I've read all make it sound like you can plaster this code everywhere and magically you'd get equal or faster code. One comment said the code can be equal or slower, but the poster didn't elaborate.
I imagine appropriate places to use this is on any function return or parameter that is a pointer or reference and that will be passed or returned within the calling thread, and it shouldn't be used on callbacks or thread entry points.
Can someone comment on my understanding and elaborate on the subject in general, of when and when not to use it?
EDIT: I know there's this tome on the subject, should any other reader be interested; it may contain my answer, but I haven't had the chance to read through it yet.

In modern C++ you should generally not use std::memory_order_consume or [[carries_dependency]] at all. They're essentially deprecated while the committee comes up with a better mechanism that compilers can practically implement.
And that hopefully doesn't require sprinkling [[carries_dependency]] and kill_dependency all over the place.
2016-06 P0371R1: Temporarily discourage memory_order_consume
It is widely accepted that the current definition of memory_order_consume in the standard is not useful. All current compilers essentially map it to memory_order_acquire. The difficulties appear to stem both from the high implementation complexity, from the fact that the current definition uses a fairly general definition of "dependency", thus requiring frequent and inconvenient use of the kill_dependency call, and from the frequent need for [[carries_dependency]] annotations. Details can be found in e.g. P0098R0.
Notably that in C++ x - x still carries a dependency but most compilers would naturally break the dependency and replace that expression with a constant 0. But also compilers do sometimes turn data dependencies into control dependencies if they can prove something about value-ranges after a branch.
On modern compilers that just promote mo_consume to mo_acquire, fully aggressive optimizations can always happen; there's never anything to gain from [[carries_dependency]] and kill_dependency even in code that uses mo_consume, let alone in other code.
This strengthening to mo_acquire has potentially-significant performance cost (an extra barrier) for real use-cases like RCU on weakly-ordered ISAs like POWER and ARM. See this video of Paul E. McKenney's CppCon 2015 talk C++ Atomics: The Sad Story of memory_order_consume. (Link includes a summary).
If you want real dependency-ordering read-only performance, you have to "roll your own", e.g. by using mo_relaxed and checking the asm to verify it compiled to asm with a dependency. (Avoid doing anything "weird" with such a value, like passing it across functions.) DEC Alpha is basically dead and all other ISAs provide dependency ordering in asm without barriers, as long as the asm itself has a data dependency.
If you don't want to roll your own and live dangerously, it might not hurt to keep using mo_consume in "simple" use-cases where it should be able to work; perhaps some future mo_consume implementation will have the same name and work in a way that's compatible with C++11.
There is ongoing work on making a new consume, e.g. 2018's http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html

because the answers I've read all make it sound like you can plaster
this code everywhere and magically you'd get equal or faster code
The only way you can get faster code is when that annotation allows the omission of a fence.
So the only case where it could possibly be useful is:
your program uses consume ordering on an atomic load operation, in an important frequently executed code;
the "consume value" isn't just used immediately and locally, but also passed to other functions;
the target CPU gives specific guarantees for consuming operations (as strong as a given fence before that operation, just for that operation);
the compiler writers take their job seriously: they manage to translate high level language consuming of a value to CPU level consuming, to get the benefit from CPU guarantees.
That's a bunch of necessary conditions to possibly get measurably faster code.
(And the latest trend in the C++ community is to give up inventing a proper compiling scheme that's safe in all cases and to come up with a completely different way for the user to instruct the compiler to produce code that "consumes" values, with much more explicit, naively translatable, C++ code.)
One comment said the code can be equal or slower, but the poster
didn't elaborate.
Of course annotations of the kind that you can randomly put on programs simply cannot make code more efficient in general! That would be too easy and also self contradictory.
Either some annotation specifies a constrain on your code, that is a promise to the compiler, and you can't put it anywhere where it doesn't correspond an guarantee in the code (like noexcept in C++, restrict in C), or it would break code in various ways (an exception in a noexcept function stops the program, aliasing of restricted pointers can cause funny miscompilation and bad behavior (formerly the behavior is not defined in that case); then the compiler can use it to optimize the code in specific ways.
Or that annotation doesn't constrain the code in any way, and the compiler can't count on anything and the annotation does not create any more optimization opportunity.
If you get more efficient code in some cases at no cost of breaking program with an annotation then you must potentially get less efficient code in other cases. That's true in general and specifically true with consume semantic, which imposes the previously described constrained on translation of C++ constructs.
I imagine appropriate places to use this is on any function return or
parameter that is a pointer or reference and that will be passed or
returned within the calling thread
No, the one and only case where it might be useful is when the intended calling function will probably use consume memory order.

How do I find how C++ compiler implements something except inspecting emitted machine code?

Suppose I crafted a set of classes to abstract something and now I worry whether my C++ compiler will be able to peel off those wrappings and emit really clean, concise and fast code. How do I find out what the compiler decided to do?
The only way I know is to inspect the disassembly. This works well for simple code, but there're two drawbacks - the compiler might do it different when it compiles the same code again and also machine code analysis is not trivial, so it takes effort.
How else can I find how the compiler decided to implement what I coded in C++?

I'm afraid you're out of luck on this one. You're trying to find out "what the compiler did". What the compiler did is to produce machine code. The disassembly is simply a more readable form of the machine code, but it can't add information that isn't there. You can't figure out how a meat grinder works by looking at a hamburger.

I was actually wondering about that.
I have been quite interested, for the last few months, in the Clang project.
One of Clang particular interests, wrt optimization, is that you can emit the optimized LLVM IR code instead of machine code. The IR is a high-level assembly language, with the notion of structure and type.
Most of the optimizations passes in the Clang compiler suite are indeed performed on the IR (the last round is of course architecture specific and performed by the backend depending on the available operations), this means that you could actually see, right in the IR, if the object creation (as in your linked question) was optimized out or not.
I know it is still assembly (though of higher level), but it does seem more readable to me:
far less opcodes
typed objects / pointers
no "register" things or "magic" knowledge required
Would that suit you :) ?

Timing the code will directly measure its speed and can avoid looking at the disassembly entirely. This will detect when compiler, code modifications or subtle configuration changes have affected the performance (either for better or worse). In that way it's better than the disassembly which is only an indirect measure.
Things like code size can also serve as possible indicators of problems. At the very least they suggest that something has changed. It can also point out unexpected code bloat when the compiler should have boiled down a bunch of templates (or whatever) into a concise series of instructions.
Of course, looking at the disassembly is an excellent technique for developing the code and helping decide if the compiler is doing a sufficiently good translation. You can see if you're getting your money's worth, as it were.
In other words, measure what you expect and then dive in if you think the compiler is "cheating" you.

You want to know if the compiler produced "clean, concise and fast code".
"Clean" has little meaning here. Clean code is code which promotes readability and maintainability -- by human beings. Thus, this property relates to what the programmer sees, i.e. the source code. There is no notion of cleanliness for binary code produced by a compiler that will be looked at by the CPU only. If you wrote a nice set of classes to abstract your problem, then your code is as clean as it can get.
"Concise code" has two meanings. For source code, this is about saving the scarce programmer eye and brain resources, but, as I pointed out above, this does not apply to compiler output, since there is no human involved at that point. The other meaning is about code which is compact, thus having lower storage cost. This can have an impact on execution speed, because RAM is slow, and thus you really want the innermost loops of your code to fit in the CPU level 1 cache. The size of the functions produced by the compiler can be obtained with some developer tools; on systems which use GNU binutils, you can use the size command to get the total code and data sizes in an object file (a compiled .o), and objdump to get more information. In particular, objdump -x will give the size of each individual function.
"Fast" is something to be measured. If you want to know whether your code is fast or not, then benchmark it. If the code turns out to be too slow for your problem at hand (this does not happen often) and you have some compelling theoretical reason to believe that the hardware could do much better (e.g. because you estimated the number of involved operations, delved into the CPU manuals, and mastered all the memory bandwidth and cache issues), then (and only then) is it time to have a look at what the compiler did with your code. Barring these conditions, cleanliness of source code is a much more important issue.
All that being said, it can help quite a lot if you have a priori notions of what a compiler can do. This requires some training. I suggest that you have a look at the classic dragon book; but otherwise you will have to spend some time compiling some example code and looking at the assembly output. C++ is not the easiest language for that, you may want to begin with plain C. Ideally, once you know enough to be able to write your own compiler, then you know what a compiler can do, and you can guess what it will do on a given code.

You might find a compiler that had an option to dump a post-optimisation AST/representation - how readable it would be is another matter. If you're using GCC, there's a chance it wouldn't be too hard, and that someone might have already done it - GCCXML does something vaguely similar. Of little use if the compiler you want to build your production code on can't do it.
After that, some compiler (e.g. gcc with -S) can output assembly language, which might be usefully clearer than reading a disassembly: for example, some compilers alternate high-level source as comments then corresponding assembly.
As for the drawbacks you mentioned:
the compiler might do it different when it compiles the same code again
absolutely, only the compiler docs and/or source code can tell you the chance of that, though you can put some performance checks into nightly test runs so you'll get alerted if performance suddenly changes
and also machine code analysis is not trivial, so it takes effort.
Which raises the question: what would be better. I can image some process where you run the compiler over your code and it records when variables are cached in registers at points of use, which function calls are inlined, even the maximum number of CPU cycles an instruction might take (where knowable at compile time) etc. and produces some record thereof, then a source viewer/editor that colour codes and annotates the source correspondingly. Is that the kind of thing you have in mind? Would it be useful? Perhaps some more than others - e.g. black-and-white info on register usage ignores the utility of the various levels of CPU cache (and utilisation at run-time); the compiler probably doesn't even try to model that anyway. Knowing where inlining was really being done would give me a warm fuzzy feeling. But, profiling seems more promising and useful generally. I fear the benefits are more intuitively real than actually, and compiler writers are better off pursuing C++0x features, run-time instrumentation, introspection, or writing D "on the side" ;-).

The answer to your question was pretty much nailed by Karl. If you want to see what the compiler did, you have to start going through the assembly code it produced--elbow grease is required. As to discovering the "why" behind the "how" of how it implemented your code...every compiler (and every build, potentially), as you mentioned, is different. There are different approaches, different optimizations, etc. However, I wouldn't worry about whether it's emitting clean, concise machine code--cleanliness and concision should be left to the source code. Speed, on the other hand, is pretty much the programmer's responsibility (profiling ftw). More interesting concerns are correctness, maintainability, readability, etc. If you want to see if it made a specific optimization, the compiler docs might help (if they're available for your compiler). You can also just trying searching to see if the compiler implements a known technique for optimizing whatever. If those approaches fail, though, you're right back to reading assembly code. Keep in mind that the code that you're checking out might have little to no impact on performance or executable size--grab some hard data before diving into any of this stuff.

Actually, there is a way to get what you want, if you can get your compiler to
produce DWARF debugging information. There will be a DWARF description for each
out-of-line function and within that description there will (hopefully) be entries
for each inlined function. It's not trivial to read DWARF, and sometimes compilers
don't produce complete or accurate DWARF, but it can be a useful source of information
about what the compiler actually did, that's not tied to any one compiler or CPU.
Once you have a DWARF reading library there are all sorts of useful tools you can
build around it.
Don't expect to use it with Visual C++ as that uses a different debugging format.
(But you might be able to do similar queries through the debug helper library
that comes with it.)

If your compiler manages to translate your "wrappings and emit really clean, concise and fast code" the effort to follow-up the emitted code should be reasonable.
Contrary to another answer I feel that emitted assembly code may well be "clean" if it is (relatively) easily mappable to the original source code, if it doesn't consist of calls all over the place and that the system of jumps is not too complex. With code scheduling and re-ordering an optimized machine code which is also readable is, alas, a thing of the past.

Do global variables mean faster code?

I read recently, in an article on game programming written in 1996, that using global variables is faster than passing parameters.
Was this ever true, and if so, is this still true today?

Short answer - No, good programmers make code go faster by knowing and using the appropriate tools for the job, and then optimizing in a methodical way where their code does not meet their requirements.
Longer answer - This article, which in my opinion is not especially well-written, is not in any case general advice on program speedup but '15 ways to do faster blits'. Extrapolating this to the general case is missing the writer's point, whatever you think of the merits of the article.
If I was looking for performance advice, I would place zero credence in an article that does not identify or show a single concrete code change to support the assertions in the sample code, and without suggesting that measuring the code might be a good idea. If you are not going to show how to make the code better, why include it?
Some of the advice is years out of date - FAR pointers stopped being an issue on the PC a long time ago.
A serious game developer (or any other professional programmer, for that matter) would have a good laugh about advice like this:
You can either take out the assert's
completely, or you can just add a
#define NDEBUG when you compile the final version.
My advice to you, if you really wish to evaluate the merit of any of these 15 tips, and since the article is 14 years old, would be to compile the code in a modern compiler (Visual C++ 10 say) and try to identify any area where using a global variable (or any of the other tips) would make it faster.
[Just joking - my real advice would be to ignore this article completely and ask specific performance questions on Stack Overflow as you hit issues in your work that you cannot resolve. That way the answers you get will be peer reviewed, supported by example code or good external evidence, and current.]

When you switch from parameters to global variables, one of three things can happen:
it runs faster
it runs the same
it runs slower
You will have to measure performance to see what's faster in a non-trivial concrete case. This was true in 1996, is true today and is true tomorrow.
Leaving the performance aside for a moment, global variables in a large project introduce dependencies which almost always make maintenance and testing much harder.
When trying to find legitimate uses of globals variables for performance reasons today I very much agree with the examples in Preet's answer: very often needed variables in microcontroller programs or device drivers. The extreme case is a processor register which is exclusively dedicated to the global variable.
When reasoning about the performance of global variables versus parameter passing, the way the compiler implements them is relevant. Global variables typically are stored at fixed locations. Sometimes the compiler generates direct addressing to access the globals. Sometimes however, the compiler uses one more indirection and uses a kind of symbol table for globals. IIRC gcc for AIX did this 15 years ago. In this environment, globals of small types were always slower than locals and parameter passing.
On the other hand, a compiler can pass parameters by pushing them on the stack, by passing them in registers or a mixture of both.

Everyone has already given the appropriate caveat answers about this being platform and program specific, needing to actually measure timings, etc. So, with that all said already, let me answer your question directly for the specific case of game programming on x86 and PowerPC.
In 1996, there were certain cases where pushing parameters onto the stack took extra instructions and could cause a brief stall inside the Intel CPU pipeline. In those cases there could be a very small speedup from avoiding parameter passing altogether and reading data from literal addresses.
This isn't true any more on the x86 or on the PowerPC used in most game consoles. Using globals is usually slower than passing parameters for two reasons:
Parameter passing is implemented better now. Modern CPUs pass their parameters in registers, so reading a value from a function's parameter list is faster than a memory load operation. The x86 uses register shadowing and store forwarding, so what looks like shuffling data onto the stack and back can actually be a simple register move.
Data cache latency far outweighs CPU clock speed in most performance considerations. The stack, being heavily used, is almost always in cache. Loading from an arbitrary global address can cause a cache miss, which is a huge penalty as the memory controller has to go and fetch the data from main RAM. ("Huge" here is 600 cycles or more.)

What do you mean, "faster"?
I know for a fact, that understanding a program with global variables takes me a whole lot more time than one without.
If the extra time it takes the programmer(s) is less than the time gained by the users when they run the program with globals, then I'd say using global is faster.
But consider that the program is going to be run by 10 people once a day for 2 years. And that it takes 2.84632 secs without globals and 2.84217 secs with globals (a 0.00415 sec increase). That's 727 seconds less of TOTAL runtime. Gaining 10 minutes of run time is not worth the introduction of a global as regards programmer time.

To a degree any code that avoids processor instructions (ie shorter code) will be faster. However how much faster? Not very! Also note that compiler optimisation strategies may result in the smaller code anyway.
These days this is only an optimisation on very specific applications usually in ultra time critical drivers or micro-control code.

Putting aside the issues of maintainability and correctness, there are basically two factors that will govern performance with regard to globals vs. parameters.
When you make a global you avoid a copy. That's slightly faster. When you pass a parameter by value, it has to be copied so that a function can work on a local copy of it and not damage the caller's copy of the data. At least in theory. Some modern optimizers do pretty tricky things if they identify that your code is well behaved. A function may get automatically inlined, and the compiler may notice that the function doesn't do anything to the parameters, and just optimise away any copying.
When you make a global, you are lying to the cache. When you have all of your variables neatly contained in your function, and a few parameters, the data will tend to all be in one place. Some of the variables will be in registers, and some will probably be in cache right away because they are right 'next to' each other. Using a lot of global variables is basically pathological behavior for the cache. There is no guarantee that various globals will be used by the same functions. Location has no obvious correlation with usage. Perhaps you have a small enough working set that it makes no difference where anything is, and it all winds up in cache.
All of this just adds up to the point made by a poster above me:
When you switch from parameters to
global variables, one of three things
can happen:
* it runs faster
* it runs the same
* it runs slower
You will have to measure performance
to see what's faster in a non-trivial
concrete case. This was true in 1996,
is true today and is true tomorrow.
Depending on the specific behavior of your exact compiler, and precise details of the hardware that you use to run your code, it's possible that global variables could be a very slight performance win in some cases. That possibility may be worth trying it on some code that runs too slow as an experiment. It's probably not worth dedicating yourself to, as the answer of your experiment could change tomorrow. So, the right answer is almost always to go with "correct" design patterns and avoid the uglier design. Look for better algorithms, more efficient data structures, etc., before intentionally trying to spaghettify your project. Much better payoff in the long run.
And, aside from the dev time vs user time argument, I'll add the dev time vs. Moore's time argument. If you assume Moore's law will make computers something like half again as fast every year, then for the sake of a simple round number, we can assume that progress happens in a steady 1% progress per week. IF you are looking at a microoptimisation that may improve things like 1%, and it will add a week to the project from complicating things, then just taking the week off will have the same effect on average run times for your users.

Perhaps a micro optimisation, and would probably be wiped out by optimisations your compiler could generate without resort to such practices. In fact the use of globals may even inhibit some compiler optimisations. Reliable and maintainable code would generally be of greater value, and globals are not conducive to that.
Using globals to replace function parameters renders all such functions non-reentrant, which may be a problem if multi-threading is used - not a common practice in game development in 1996, but more common with the advent of multi-core processors. It also precludes recursion, although that is probably less of an issue since recursion has its own issues.
In any significant body of code, there is likely to be more mileage in higher-level optimisation of algorithms and data structures. Moreover there are options open to you other than global variables that avoid parameter passing, most especially C++ class-member variables.
If the habitual use of global variables in your code makes a measurable or useful difference to its performance, I would question the design first.
For a discussion of the problems inherent in global variables and some ways to avoid them see A Pox on Globals by Jack Gannsle. The article relates to embedded systems development, but is generally applicable; its just that some embedded systems developers think they have good reason to use globals, probably for all the same misguided reasons used to justify it in game development.

Well, if you are considering using global parameters instead of parameter passing, that could mean that you have a long chain of methods/functions that you have to pass that parameter down. It that is the case, you really WILL save CPU cycles by switching from parameter to global variable.
So, guys that say that it depends, I guess that they are plain wrong. Even with REGISTER parameter passing, there will still be MORE cpu cycles and MORE overhead for pushing the parameters down to the callee.
HOWEVER - I never do that. CPUs are superior now, and at times when there were 12Mhz 8086s that could be the issue. Nowadays, if you don't write embedded or super-turbo-charged performance code, stick to that which looks good in code, which doesn't break code logic, and thrives to be modular.
And lastly, leave machine language code generation to compiler - guys who designed it are best at knowing how their baby performs and will make your code run at its best.

In general (but it may depend greatly on compiler and platform implementation), passing parameters mean writing them onto the stack which you would not need with global variable.
That said, global variable may mean include page refresh in the MMU or memory controller whereas the stack may be located in much faster memory available to the processor...
Sorry, no good answer for a general question like this, just measure it (and try different scenarios too)

It was faster when we had <100mhz processors. Now that that processors are 100x faster this 'problem' is 100x less significant. It wasnt a big deal then, it was a big deal when you did it in assembly and had no (good) optimizer.
Says the guy who programmed on a 3mhz processor. Yes you read that right and 64k was NOT enough.

I see a lot of theoretical answers, but no practical advice for your scenario. What I'm guessing is that you have a large number of parameters to pass down through a number of function calls, and you're worried about accumulated overhead from many levels of call frames and many parameters at each level. Otherwise your concern is completely unfounded.
If this is your scenario, you should probably put all of the parameters in a "context" structure and pass a pointer to that structure. This will ensure data locality, and makes it so you don't have to pass more than one argument (the pointer) at each function call.
Parameters accessed this way are slightly more expensive to access than true function arguments (you need an extra register to hold the pointer to the base of the structure, as opposed to the frame pointer which would serve this purpose with function arguments), and individually (but probably not with cache effects factored in) more expensive to access than global variables in normal, non-PIC code. However, if your code is in a shared library/DLL using position independent code, the cost of accessing parameters passed by pointer to struct is cheaper than accessing a global variable and identical to accessing static variables, due to GOT and GOT-relative addressing. This is another reason never to use global variables for parameter passing: if you may eventually put your code in a shared library/DLL, any possible performance benefits will suddenly backfire!

Like everything else: yes and no. There is no one answer because it depends on context.
Counterpoints:
Imagine programming on Itanium where you have hundreds of registers. You can put quite a few globals into those, which will be faster than the typical way globals are implemented in C (some static address (although they might just hardcode the globals into instructions if they are word length)). Even if the globals are in cache the whole time, registers may still be faster.
In Java, overuse of globals (statics) can decrease performance because of initialization locks that have to be done. If 10 classes want to access some static class, they all have to wait for that class to finish initializing its static fields, which can take anywhere form no time up to forever.
In any case, global state is just bad practice, it raises code complexity. Well designed code is naturally fast enough 99.9% of the time. It seems like newer languages are removing global state all together. E removes global state because it violates their security model. Haskell removes state all together. The fact that Haskell exists and has implementations that outperform most other languages is proof enough for me that I will never use globals again.
Also, in the near future, when we all have hundreds of cores, global state isn't really going to help much.

It might still be true, under some circumstances.
A global variable might be as fast as a pointer to a variable, where its pointer is stored in/passed through registers only. So, it is a question about the count of registers, you can use.
To speed-optimize a function call, you could do several other things, that might perform better with global-variable-hacks:
Minimize the count of local variables in the function to a few (explicit) register variables.
Minimize the count of parameters of the function, i.e. by using pointers to structures instead of using the same parameter-constellations in functions that call each other.
Make the function "naked", that means that it does not use the stack at all.
Use "proper-tail-calls" (does neither work with java/-bytecode nor java-/ecma-script)
If there is no better way, hack yourself sth like TABLES_NEXT_TO_CODE, which locates your global variables next to the function code. In functional languages this is a backend-optimization that uses the function-pointer as data-pointer, too; but as long as you do not program in a functional language, you only need to locate those variables beside those used by the function. Then again, you only want this to remove the stack-handling from your function. If your compiler generates assembler code that handles the stack, then there is no point in doing this, you could use pointers instead.
I've found this "gcc attribute overview":
http://www.ohse.de/uwe/articles/gcc-attributes.html
and I can give you these tags for googling:
- Proper Tail Call (it is mostly relevant to imperative backends of functional languages)
- TABLES_NEXT_TO_CODE (it is mostly relevant to Haskell and LLVM)

But you have 'spaghetti code', when you often use global variables.

Mixing assembler code with c/c++

Why is assembly language code often needed along with C/C++ ?
What can't be done in C/C++, which is possible when assembly language code is mixed?
I have some source code of some 3D computer games. There are a lot of assembler code in use.

Things that pop to mind, in no particular order:
Special instructions. In an embedded application, I need to invalidate the cache after a DMA transfer has filled the memory buffer. The only way to do that on an SH-4 CPU is to execute a special instruction, so inline assembly (or a free-standing assembly function) is the only way to go.
Optimizations. Once upon a time, it was common for compilers to not know every trick that was possible to do. In some of those cases, it was worth the effort to replace an inner loop with a hand-crafted version. On the kinds of CPUs you find in small embedded systems (think 8051, PIC, and so forth) it can be valuable to push inner loops into assembly. I will emphasize that for modern processors with pipelines, multi-issue execution, extensive caching and more, it is often exceptionally difficult for hand coding to even approach the capabilities of the optimizer.
Interrupt handling. In an embedded application it is often needed to catch system events such as interrupts and exceptions. It is often the case that the first few instructions executed by an interrupt have special responsibilities and the only way to guarantee that the right things happen is to write the outer layer of a handler in assembly. For example, on a ColdFire (or any descendant of the 68000) only the very first instruction is guaranteed to execute. To prevent nested interrupts, that instruction must modify the interrupt priority level to mask out the priority of the current interrupt.
Certain portions of an OS kernel. For example, task switching requires that the execution state (at least most registers including PC and stack pointer) be saved for the current task and the state loaded for the new task. Fiddling with execution state of the CPU is well outside of the feature set of the language, but can be wrapped in a small amount of assembly code in a way that allows the rest of the kernel to be written in C or C++.
Edit: I've touched up the wording about optimization. Let me emphasize that for targets with large user populations and well supported compilers with decent optimization, it is highly unlikely that an assembly coder can beat the performance of the optimizer.
Before attempting, start by careful profiling to determine where the bottlenecks really lie. With that information in hand, examine assumptions and algorithms carefully, because the best optimization of all is usually to find a better way to handle the larger picture. Then, if all else fails, isolate the bottleneck in a test case, benchmark it carefully, and begin tweaking in assembly.

Why is assembly language code often
needed along with C/C++ ?
Competitive advantage. Like, if you are writing software for the (soon-to-be) #1 gaming company in the world.
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Nothing, unless some absolute performance level is needed, say, X frames per second or Y billions of polygons per second.
Edit: based on other replies, it seems the consensus is that embedded systems (iPhone, Android etc) have hardware accelerators that certainly require the use of assembly.
I have some source code of some 3D
computer games. There are a lot of
assembler code in use.
They are either written in the 80's-90's, or they are used sparingly (maybe 1% - 5% of total source code) inside a game engine.
Edit: to this date, compiler auto-vectorization quality is still poor. So, you may see programs that contain vectorization intrinsics, and since it's not really much different from writing in actual assembly (most intrinsics have one-one mapping to assembly instructions) some folks might just decide to write in assembly.
Update:
According to anecdotal evidence, RollerCoaster Tycoon is written in 99% assembly.
http://www.chrissawyergames.com/faq3.htm

In the past, compilers used to be pretty poor at optimizing for a particular architecture, and architectures used to be simpler. Now the reverse is true. These days, it's pretty hard for a human to write better assembly than an optimizing compiler, for deeply-pipelined, branch-predicting processors. And so you won't see it much. What there is will be short, and highly targeted.
In short, you probably won't need to do this. If you think you do, profile your code to make sure you've identified a hotspot - don't optimize something just because it's slow, if you're only spending 0.1% of your execution time there. See if you can improve your design or algorithm. If you don't find any improvement there, or if you need functionality not exposed by your higher-level language, look into hand-coding assembly.

There are certain things that can only be done in assembler and cannot be done in C/C++.
These include:
generating software interrupts (SWI or INT instructions)
Use of instructions like SWP for creating mutexes
specialist coporcessor instructions (such as those needed to program the MMU and manage RAM caches)
Access to carry and overflow flags.
You may also be able to optimize code better in assembler than C/C++ (eg memcpy on Android is written in assembler)

There may be new instructions that your compiler cannot yet generate, or the compiler does a bad job, or you may need to control the CPU directly.

Why is assembly language code often
needed along with C/C++ ?needed along with C/C++ ?
It isn't
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Accessing system registers or IO ports on the CPU.
Accessing BIOS functions.
Using specialized instructions that doesn't map directly to the programming language,
e.g. SIMD instructions.
Provide optimized code that's better than the compiler produces.
The two first points you usually don't need unless you're writing an operating system, or code
running without an operatiing system.
Modern CPUs are quite complex, and you'll be hard pressed to find people that actually can write assembly than what the compiler produces. Many compilers come with libraries giving you access
to more advanced features, like SIMD instructions, so nowadays you often don't need to fall back to
assembly for that.

One more thing worth mentioning is:
C & C++ do not provide any convenient way to setup stack frames when one needs to implement a binary level interop with a script language - or to implement some kind of support for closures.

Assembly can be very optimal than what any compiler can generate in certain situations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js