Runtime optimization of static languages: JIT for C++?

Runtime optimization of static languages: JIT for C++? - c++

Is anyone using JIT tricks to improve the runtime performance of statically compiled languages such as C++? It seems like hotspot analysis and branch prediction based on observations made during runtime could improve the performance of any code, but maybe there's some fundamental strategic reason why making such observations and implementing changes during runtime are only possible in virtual machines. I distinctly recall overhearing C++ compiler writers mutter "you can do that for programs written in C++ too" while listening to dynamic language enthusiasts talk about collecting statistics and rearranging code, but my web searches for evidence to support this memory have come up dry.

Profile guided optimization is different than runtime optimization. The optimization is still done offline, based on profiling information, but once the binary is shipped there is no ongoing optimization, so if the usage patterns of the profile-guided optimization phase don't accurately reflect real-world usage then the results will be imperfect, and the program also won't adapt to different usage patterns.
You may be interesting in looking for information on HP's Dynamo, although that system focused on native binary -> native binary translation, although since C++ is almost exclusively compiled to native code I suppose that's exactly what you are looking for.
You may also want to take a look at LLVM, which is a compiler framework and intermediate representation that supports JIT compilation and runtime optimization, although I'm not sure if there are actually any LLVM-based runtimes that can compile C++ and execute + runtime optimize it yet.

I did that kind of optimization quite a lot in the last years. It was for a graphic rendering API that I've implemented. Since the API defined several thousand different drawing modes as general purpose function was way to slow.
I ended up writing my own little Jit-compiler for a domain specific language (very close to asm, but with some high level control structures and local variables thrown in).
The performance improvement I got was between factor 10 and 60 (depended on the complexity of the compiled code), so the extra work paid off big time.
On the PC I would not start to write my own jit-compiler but use either LIBJIT or LLVM for the jit-compilation. It wasn't possible in my case due to the fact that I was working on a non mainstream embedded processor that is not supported by LIBJIT/LLVM, so I had to invent my own.

The answer is more likely: no one did more than PGO for C++ because the benefits are likely unnoticeable.
Let me elaborate: JIT engines/runtimes have both blesses and drawbacks from their developer's view: they have more information at runtime but much little time to analyze.
Some optimizations are really expensive and you will unlikely see without a huge impact on start time are those one like: loop unrolling, auto-vectorization (which in most cases is also based on loop unrolling), instruction selection (to use SSE4.1 for CPU that use SSE4.1) combined with instruction scheduling and reordering (to use better super-scalar CPUs). This kind of optimizations combine great with C like code (that is accessible from C++).
The single full-blown compiler architecture to do advanced compilation (as far as I know) is the Java Hotspot compilation and architectures with similar principles using tiered compilation (Java Azul's systems, the popular to the day JaegerMonkey JS engine).
But one of the biggest optimization on runtime is the following:
Polymorphic inline caching (meaning that if you run the first loop with some types, the second time, the code of the loop will be specialized types that were from previous loop, and the JIT will put a guard and will put as default branch the inlined types, and based on it, from this specialized form using a SSA-form engine based will apply constant folding/propagation, inlining, dead-code-elimination optimizations, and depends of how "advanced" the JIT is, will do an improved or less improved CPU register assignment.)
As you may notice, the JIT (hotspots) will improve mostly the branchy code, and with runtime information will get better than a C++ code, but a static compiler, having at it's side the time to do analysis, instruction reordering, for simple loops, will likely get a little better performance. Also, typically, the C++ code, areas that need to be fast tends to not be OOP, so the information of the JIT optimizations will not bring such an amazing improvement.
Another advantage of JITs is that JIT works cross assemblies, so it has more information if it wants to do inlining.
Let me elaborate: let's say that you have a base class A and you have just one implementation of it namely B in another package/assembly/gem/etc. and is loaded dynamically.
The JIT as it see that B is the only implementation of A, it can replace everywhere in it's internal representation the A calls with B codes, and the method calls will not do a dispatch (look on vtable) but will be direct calls. Those direct calls may be inlined also. For example this B have a method: getLength() which returns 2, all calls of getLength() may be reduced to constant 2 all over. At the end a C++ code will not be able to skip the virtual call of B from another dll.
Some implementations of C++ do not support to optimize over more .cpp files (even today there is the -lto flag in recent versions of GCC that makes this possible). But if you are a C++ developer, concerned about speed, you will likely put the all sensitive classes in the same static library or even in the same file, so the compiler can inline it nicely, making the extra information that JIT have it by design, to be provided by developer itself, so no performance loss.

visual studio has an option for doing runtime profiling that then can be used for optimization of code.
"Profile Guided Optimization"

Microsoft Visual Studio calls this "profile guided optimization"; you can learn more about it at MSDN. Basically, you run the program a bunch of times with a profiler attached to record its hotspots and other performance characteristics, and then you can feed the profiler's output into the compiler to get appropriate optimizations.

I believe LLVM attempts to do some of this. It attempts to optimize across the whole lifetime of the program (compile-time, link-time, and run-time).

Reasonable question - but with a doubtful premise.
As in Nils' answer, sometimes "optimization" means "low-level optimization", which is a nice subject in its own right.
However, it is based on the concept of a "hot-spot", which has nowhere near the relevance it is commonly given.
Definition: a hot-spot is a small region of code where a process's program counter spends a large percentage of its time.
If there is a hot-spot, such as a tight inner loop occupying a lot of time, it is worth trying to optimize at the low level, if it is in code that you control (i.e. not in a third-party library).
Now suppose that inner loop contains a call to a function, any function. Now the program counter is not likely to be found there, because it is more likely to be in the function. So while the code may be wasteful, it is no longer a hot-spot.
There are many common ways to make software slow, of which hot-spots are one. However, in my experience, that is the only one of which most programmers are aware, and the only one to which low-level optimization applies.
See this.

Related

Why do memcpy() and other similar functions use assembly?

I took a look at the parts of the code behind memcpy and other functions (memset, memmove, ...) and it seems to be a lot, and a lot of assembly code.
Other stackoverflow questions on this topic mention that a reason for that may be because it contains different code for different CPU architectures.
I have personally written my own memcpy/memset functions with very few lines of C++ code and in 1 million iterations with time measured with chrono, I consistently get better times.
So the question is, why did the programmers not just write the code in C/C++ and let the compiler interpret and optimize it how it thinks is best? Why so much assembly code?

This "It's pointless to rewrite in assembly" is a myth. A more accurate way to express it is that few programmers have the skill required to beat the compiler. But they do exist, and especially among those who develop compilers.

It's technically impossible to write memcpy in standard C++ and C as you have to rely on undefined constructs. The same is true for other standard library functions; memset and malloc are two other examples.
But that's not only reason: A C and C++ standard library implementation is, these days, so closely coupled with a particular compiler that the library writers can take all sorts of liberties that you, as a consumer, cannot. isupper, toupper, &c. stand out as good examples where a particular character encoding can be assumed.
Another good reason is that expertly handcrafted assembly can be difficult to beat for performance.

Compiler usually generates some unnecessary code (compared to hand written assembly) even on full optimization level. This wastes memory space which is not good specially on embedded systems and reduces performance.
Are you sure your custom codes are complete and flawless? I don't think so; because when you are writing assembly, you have full control on everything, but when you compile a code, there is a possibility that compiler generates something that you don't want (and it's your fault, not compiler).
It's almost impossible for compiler to generate code which is as complete as hand written assembly and is smaller than it at the same time.
As mentioned in some comments, it also depends on platform.

The memcpy and memset as well as other function, are written in assembly to take advantage of processor specific instructions.
For example, the ARM processor has a function that can load multiple registers from successive locations with one instruction. There is also the store multiple instruction that stores multiple registers into successive locations. The Intel x86 has block read and write instructions.
The assembly language allows for copying 4 8-bit bytes using a single 32-bit register.
Some processors allow for conditional execution of instructions, which helps when rolling out loops.
I've written optimized memcpy and memset functions for various processors. I've also spent a lot of time arguing (discussing) C and C++ "best" implementations with compilers. It's a little difficult using C or C++ to try and get the compiler to use the processor instructions you want it to.

Why did the programmers not just write the code in C/C++
We aren't mind readers. We don't even know what they wrote. If you need an authoritative answer, then you should ask the programmers that wrote the code.
But we can hypothesise, that they wrote what they did because it was fast, and did the right thing.

Are the Optimization Keywords in C and C++ Reasonable?

So we've all heard the don't-use-register line, the reasoning being that trying to out-optimize a compiler is a fool's errand.
register, from what I know, doesn't actually state anything about CPU registers, just that a given variable can't be referenced indirectly. I'll hazard a guess that it's often referred to as obsolete because compilers can detect a lack of addressing automatically thus making such optimizations transparent.
But if we're firm on that argument, can't it be levelled at every optimization-driven keyword in C? Why do we use inline and C99's restrict for example?
I suppose that some things like aliasing make deducing some optimizations hard or even impossible, so where is the line drawn before we start venturing into Sufficiently Smart Compiler territory?
Where should the line should be drawn in C and C++ between spoon-feeding a compiler optimization information and assuming it knows what it's doing?
EDIT: Jens Gustedt pointed out that my conflating of C and C++ isn't right since two of the keywords have semantic differences and one doesn't exist in standard C++. I had a good link about register in C++ which I'll add if I find it...

I would agree that register and inline are somewhat similar in this respect. If the compiler can see the body of the callee while compiling a call site, it should be able to make a good decision on inlining. The use of the inline keyword in both C and C++ has more to do with the mechanics of making the body of the function visible than with anything else.
restrict, however, is different. When compiling a function, the compiler has no idea of what the call sites are going to be. Being able to assume no aliasing can enable optimizations that would otherwise be impossible.

inline is used in the scenario where you implement a non-templated function within the header then include it from multiple compilation units.
This ensures that the compiler should create just one instance of the function as though it were inlined, so you do not get a link error for multiply defined symbol. It does not however require the compiler to actually inline it.
There are GNU flags I think force-inline or similar but that is a language extension.

register doesn't even say that you can't reference the
variable indirectly (at least in C++). It said that in the
original C, but that has been dropped.
Whether trying to out-optimize the compiler is a fool's errand
depends on the optimization. Not many compilers, for example,
will convert sin(x) * sin(x) + cos(x) * cos(x) into 1.
Today, most compilers ignore register, and no one uses it,
because compilers have become good enough at register allocation
to do a better job than you can with register. In fact,
respecting register would typically make the generated code
slower. This is not the case for inline or restrict: in
both cases, there exist techniques, at least theoretically,
which could result in the compiler doing a better job than you
can. Such techniques are not widespread, however, and (as far
as I know, at least), have a very high compile time overhead,
with in some cases compile times which grow exponentially with
the size of the program (which makes them more or less unusable
on most real programs—compile times which are measured in
years really aren't acceptable).
As to where to draw the line... it changes in time. When
I first started programming in C, register made a significant
difference, and was widely used. Today, no. I imagine that in
time, the same may happen with inline or restrict—some
experimental compilers are very close with inline already.

This is a flame-bait question but I will dive in anyway.
Compilers are a lot better at optimising that your average programmer. There was a time I programmed on a 25MHz 68030 and I got some advantage from the use of register because the compiler's optimizer was so poor. But that was back in 1990.
I see inline as just as bad as register.
In general, measure first before you modify. If you find that you code performs so poorly you want to use register or inline, take a deep breath, stand back and look for a better algorithm first.
In recent times (i.e. the last 5 years) I have gone through code bases and removed inline functions galore with no perceptible change in performance being visible. Code size, however, always benefits from the removal of inline methods. That isn't a big issue for your standard x86-style monster multicore marvel of the modern age but it does matter if you work in the embedded space.

It is a moving target, because compiler technology is improving. (Well, sometimes it is more changing than improving, but that has some of the same effect of rendering your optimization attempts moot, or worse.)
Generally, you should not guess at whether an optimization keyword or other optimization technique is good or not. One has to learn quite a bit about how computers work, including the particular platform you are targeting, and how compilers work.
So a rule about using various optimization techniques is to ask do I know the compiler will not do the best job here? Am I willing to commit to that for a while—will the compiler remain stable while this code is in use, am I willing to rewrite the code when the compiler changes this situation? Typically, you have to be an experienced and knowledgeable software engineer to know when you can do better than the compiler. It also helps if you can talk to the compiler developers.
This means people cannot give you an answer here that has a definite guideline. It depends on what compiler you are using, what your project is, what your resources are, and what your goals are, and so on.
Although some people say not to try to out-optimize the compiler, there are various areas of software engineering where people do better than a compiler and in which it is worth the expense of paying people for this.

The difference is as follows:
register is very local optimization (i.e. inside one function). The register allocation is a relatively solved problem both by smarter compilers and by larger number of register (mostly the former but say x86-64 have more registers then x86 and both have larger number then say 8-bit processor)
inline is harder as it is inter-procedure optimization. However as it involves relatively small depth of recursion and small number of procedures (if inlined procedure is too big there is no sense of inlining it) it may be safely left to the compiler.
restrict is much harder. To fully know the that two pointers don't alias you would need to analyse whole program (including libraries, system, plug-ins etc.) - and even then run into problems. However the information is clearer for programmer AND it is part of specification.
Consider very simple code:
void my_memcpy(void *dst, const void *src, size_t size) {
for (size_t i = 0; i < size; i++) {
((char *)dst)[i] = ((const char *)str)[i];
}
}
Is there a benefit to making this code efficient? Yes - memcpy tend to be very useful (say for copying GC). Can this code be vectorized (here - moved by words - say 128b instead of 8b)? Compiler would have to deduce that dst and src does not alias in any way and regions pointed by them are independent. size may depend on user input or runtime behaviour or other elements which makes the analysis practically impossible - similar problems to Halting Problem - in general we cannot analyse everything without running it. Or it might be part of C library (I assume shared libraries) and is called by program hence all call sites are not even known at compile time. Without such analysis the program would exhibit different behaviour with optimization on. On the other hand programmer might ensure that they are different objects simply by knowing the (even higher-level) design instead of need for bottom-up analysis.
restrict can also be part of documentation as it might be programmer who wrote the procedure in a way that it cannot handle 2 aliasing pointers. For example if we want to copy memory from aliasing locations the above code is incorrect.
So to sum up - Sufficiently Smart Compiler would not be able to deduce the restrict (unless we move to compilers understending the meaning of code) without knowing the whole program. Even then the it would be close to undecidability. However for local optimization the compilers are already sufficiently smart. My guess it that Sufficiently Smart Compiler with whole program analysis would be able to deduce in many interesting cases however.
PS. By local I mean single function. So local optimization cannot assume anything about arguments, global variables etc.

One thing that hasn't been mentioned is that many non-x86 compilers aren't nearly as good at optimizing as gcc and other "modern" C-compilers are.
For instance, the compilers for PIC are absolutely terrible at optimizing. Also, the optimizer for cicc (the CUDA compiler), though much better, still seems to miss a lot of fairly simple optimizations.
For these cases, I've found optimization hints like register, inline, and #pragma unroll to be extremely useful.

From what I have seen back in the days I was more involved with C/C++, these are merely orders directly given to the compiler. Compiler may try to inline a function even if it is not given the direct order to do so. That really depends on the compiler and may even raise some cross-compiler issues. As an example, visual studio provides different levels of optimization which correspond to the different intelligence levels of the compiler. I have read that all class functions are implicitly inline to give compiler a hint to minimize function call overhead. In any case, these directives are extremely helpful when you are using a less intelligent compiler while in intelligent cases, they may be very obvious for the compiler to do some optimization.
Also, be sure that these keywords are guaranteed to be safe. Some compiler optimizations may not work with some libraries such as OpenGL (as I have seen it myself). So in cases where you feel that compiler optimization may be harmful, you can use these keywords to make sure it is done the way you want it to.
The compilers such as g++ these days optimize the code very well. You might as well search for optimization elsewhere, maybe in the methods and algorithm you use or by using TBB or CUDA to make your code parallel.

Why are generated binaries so large?

Why are the binaries that are generated when I compile my C++ programs so large (as in easily 10 times the size of the source code files)? What advantages does this offer over interpreted languages for which such compilation is not necessary (and thus the program size is only the size of the code files)?

Modern interpreted languages do typically compile the code to some manner of representation for faster execution... it might not get written out to disk, but there's certainly no guarantee that the program is represented in a more compact form. Some interpreters go the whole hog and generate machine code anyway (e.g. Java JIT). Then there's the interpreter itself sitting in memory which can be large.
A few points:
The more sophisticated the commands in the source code, the more machine code operations might be required to execute them. Thus, higher level language features tend to have a higher ratio of compiled-code to source code. That's not necessarily a bad thing: think of it as "I only have to say a little about what I want done and it infers all those necessary steps". The challenge in programming is to ensure they are necessary - that requires good library and program design.
The compiler often deliberately decides to trade some executable size for faster expected execution speed: inline vs out-of-line code is part of this compromise, though for small functions neither may be consistently more compact.
More sophisticated run-time environments (e.g. adding support for C++ exceptions) can involve a bit of extra code that runs when the program first starts to construct the necessary environment for that language feature.
Libraries feature may not be comparable. As well as the sort of add-on libraries you're very likely to have had to track down yourself and be very aware of using (e.g. XML, PDF parsing, OpenGL), languages often quietly use supporting libraries for what seem like language features and functions. Any of these can be suprisingly large.
For example, many interpreters just expose the C library's printf() statement or something similar, while for output formatting C++ has ostream - a more complex, extensible and type-safe system with (for better or worse) persistent state across function calls, routines to query and set that state, an additional layer of customisable buffering, customisable character types and localisation, and generally a lot of small inline functions that can lead to smaller or larger programs depending on the exact use and compiler settings. What's best depends on your application and memory vs performance goals.
Inbuilt language statements may be compiled differently: a switch on an integer expression and have 100 case labels spread randomly between 1 and 1000: one compiler/languages might decide to "pack" the 100 cases and do a binary search for a match, another to use a sparsely populated array of 1000 elements and do direct indexing (which wastes space in the executable but typically makes for faster code). So, it's hard to draw conclusions based on executable size.
Typically, memory usage and execution speed become increasingly important as the program gets larger and more complex. You don't see systems like Operating Systems, enterprise web servers or full-featured commercial word processors written in interpreted languages because they don't have the scalability.

Interpreted languages assume an interpreter is available while compiled programs are in most cases standalone.

Take a trivial case: Suppose you have a one line program
print("hello world")
what does that "print" do? Surely it's clear that your asking some other code to do some work? And that code isn't free, the sum total of what needs to run is much more than the lines of code you write. In more realistic programs you exploit many sophisticated libraries managing windows and other UI features, networks, databases and so on. Now whether that code is bundled into your application or loaded from DLLs or is present in the interpreter it's got to be somewhere.
There are plenty of trades-off between compilation and interpretation, and intermediate solutions such as Java's compilation/byte-code interpreatation approach. For example, you might consider
the run-time cost of interpreting the source every time you run versus running the compiled code
the portability advantages of interpreters - you need to compile separate versions of an app for different platforms.

Usually, programs are written in higher level languages, for these programs to be executed by the CPU, the programs have to be converted to machine code. This conversion is done by a Compiler or an Interpreter.
A Compiler makes the conversion just once, while an Interpreter typically converts it every time a program is executed.
Interpreted programs run much slower than compiled programs because the interpreter must analyze each statement in the program each time it is executed and then perform the desired action, whereas the compiled code just performs the action within a fixed context determined by the compilation(which is the reason for presence of large sized binary files).
Another disadvantage of Interpreters is that they must be present in the enviornment as additional software to run the source code.

Using Assembly Language in C/C++

I remember reading somewhere that to really optimize & speed up certain section of the code, programmers write that section in Assembly language. My questions are -
Is this practice still done? and How does one do this?
Isn't writing in Assembly Language a bit too cumbersome & archaic?
When we compile C code (with or without -O3 flag), the compiler does some code optimization & links all libraries & converts the code to binary object file. So when we run the program it is already in its most basic form i.e. binary. So how does inducing 'Assembly Language' help?
I am trying to understand this concept & any help or links is much appreciated.
UPDATE: Rephrasing point 3 as requested by dbemerlin- Because you might be able to write more effective assembly code than the compiler generates but unless you are an assembler expert your code will propably run slower because often the compiler optimizes the code better than most humans can.

The only time it's useful to revert to assembly language is when
the CPU instructions don't have functional equivalents in C++ (e.g. single-instruction-multiple-data instructions, BCD or decimal arithmetic operations)
AND the compiler doesn't provide extra functions to wrap these operations (e.g. C++11 Standard has atomic operations including compare-and-swap, <cstdlib> has div/ldiv et al for getting quotient and remainder efficiently)
AND there isn't a good third-party library (e.g. http://mitpress.mit.edu/catalog/item/default.asp?tid=3952&ttype=2)
OR
for some inexplicable reason - the optimiser is failing to use the best CPU instructions
...AND...
the use of those CPU instructions would give some significant and useful performance boost to bottleneck code.
Simply using inline assembly to do an operation that can easily be expressed in C++ - like adding two values or searching in a string - is actively counterproductive, because:
the compiler knows how to do this equally well
to verify this, look at its assembly output (e.g. gcc -S) or disassemble the machine code
you're artificially restricting its choices regarding register allocation, CPU instructions etc., so it may take longer to prepare the CPU registers with the values needed to execute your hardcoded instruction, then longer to get back to an optimal allocation for future instructions
compiler optimisers can choose between equivalent-performance instructions specifying different registers to minimise copying between them, and may choose registers in such a way that a single core can process multiple instructions during one cycle, whereas forcing everythingt through specific registers would serialise it
in fairness, GCC has ways to express needs for specific types of registers without constraining the CPU to an exact register, still allowing such optimisations, but it's the only inline assembly I've ever seen that addresses this
if a new CPU model comes out next year with another instruction that's 1000% faster for that same logical operation, then the compiler vendor is more likely to update their compiler to use that instruction, and hence your program to benefit once recompiled, than you are (or whomever's maintaining the software then is)
the compiler will select an optimal approach for the target architecture its told about: if you hardcode one solution then it will need to be a lowest-common-denominator or #ifdef-ed for your platforms
assembly language isn't as portable as C++, both across CPUs and across compilers, and even if you seemingly port an instruction, it's possible to make a mistake re registers that are safe to clobber, argument passing conventions etc.
other programmers may not know or be comfortable with assembly
One perspective that I think's worth keeping in mind is that when C was introduced it had to win over a lot of hardcore assembly language programmers who fussed over the machine code generated. Machines had less CPU power and RAM back then and you can bet people fussed over the tiniest thing. Optimisers became very sophisticated and have continued to improve, whereas the assembly languages of processors like the x86 have become increasingly complicated, as have their execution pipelines, caches and other factors involved in their performance. You can't just add values from a table of cycles-per-instruction any more. Compiler writers spend time considering all those subtle factors (especially those working for CPU manufacturers, but that ups the pressure on other compilers too). It's now impractical for assembly programmers to average - over any non-trivial application - significantly better efficiency of code than that generated by a good optimising compiler, and they're overwhelmingly likely to do worse. So, use of assembly should be limited to times it really makes a measurable and useful difference, worth the coupling and maintenance costs.

First of all, you need to profile your program. Then you optimize the most used paths in C or C++ code. Unless advantages are clear you don't rewrite in assembler. Using assembler makes your code harder to maintain and much less portable - it is not worth it except in very rare situations.

(1) Yes, the easiest way to try this out is to use inline assembly, this is compiler dependent but usually looks something like this:
__asm
{
mov eax, ebx
}
(2) This is highly subjective
(3) Because you might be able to write more effective assembly code than the compiler generates.

You should read the classic book Zen of Code Optimization and the followup Zen of Graphics Programming by Michael Abrash.
Summarily in the first book he explained how to use assembly programming pushed to the limits. In the followup he explained that programmers should rather use some higher level language like C and only try to optimize very specific spots using assembly, if necessary at all.
One motivation of this change of mind was that he saw that highly optimized programs for one generation of processor could become (somewhat) slow in the next generation of the same processor familly compared to code compiled from a high level language (maybe compiler using new instructions for instance, or performance and behavior of existing ones changing from a processor generation to another).
Another reason is that compilers are quite good and optimize aggressively nowaday, there is usually much more performance to gain working on algorithms that converting C code to assembly. Even for GPU (Graphic Cards processors) programming you can do it with C using cuda or OpenCL.
There are still some (rare) cases when you should/have to use assembly, usually to get very fine control on the hardware. But even in OS kernel code it's usually very small parts and not that much code.

There's very few reasons to use assembly language these days, even low-level constructs like SSE and the older MMX have built-in intrinsics in both gcc and MSVC (icc too I bet but I never used it).
Honestly, optimizers these days are so insanely aggressive that most people couldn't match even half their performance writing code in assembly. You can change how data is ordered in memory (for locality) or tell the compiler more about your code (through #pragma), but actually writing assembly code... doubt you'll get anything extra from it.
#VJo, note that using intrinsics in high level C code would let you do the same optimizations, without using a single assembly instruction.
And for what it's worth, there have been discussions about the next Microsoft C++ compiler, and how they'll drop inline assembly from it. That speaks volumes about the need for it.

I dont think you specified the processor. Different answers depending on the processor and the environment. The general answer is yes it is still done, it is not archaic certainly. The general reason is the compilers, sometimes they do a good job at optimizing in general but not really well for specific targets. Some are really good at one target and not so good at others. Most of the time it is good enough, most of the time you want portable C code and not non-portable assembler. But you still find that C libraries will still hand optimize memcpy and other routines that the compiler simply cannot figure out that there is a very fast way to implement it. In part because that corner case is not worth spending time on making the compiler optimize for, just solve it in assembler and the build system has a lot of if this target then use C if that target use C if that target use asm, if that target use asm. So it still occurs, and I argue must continue forever in some areas.
X86 is is own beast with a lot of history, we are at a point where you really cannot in a practical manner write one blob of assembler that is always faster, you can definitely optimize routines for a specific processor on a specific machine on a specific day, and out perform the compiler. Other than for some specific cases it is generally futile. Educational but overall not worth the time. Also note the processor is no longer the bottleneck, so a sloppy generic C compiler is good enough, find the performance elsewhere.
Other platforms which often means embedded, arm, mips, avr, msp430, pic, etc. You may or may not be running an operating system, you may or may not be running with a cache or other such things that your desktop has. So the weaknesses of the compiler will show. Also note that programming languages continue to evolve away from processors instead of toward them. Even in the case of C considered perhaps to be a low level language, it doesnt match the instruction set. There will always be times where you can produce segments of assembler that outperform the compiler. Not necessarily the segment that is your bottleneck but across the entire program you can often make improvements here and there. You still have to check the value of doing that. In an embedded environment it can and does make the difference between success and failure of a product. If your product has $25 per unit invested in more power hungry, board real estate, higher speed processors so you dont have to use assembler, but your competitor spends $10 or less per unit and is willing to mix asm with C to use smaller memories, use less power, cheaper parts, etc. Well so long as the NRE is recovered then the mixed with asm solution will in the long run.
True embedded is a specialized market with specialized engineers. Another embedded market, your embedded linux roku, tivo, etc. Embedded phones, etc all need to have portable operating systems to survive because you need third party developers. So the platform has to be more like a desktop than an embedded system. Buried in the C library as mentioned or the operating system there may be some assembler optimizations, but as with the desktop you want to try to throw more hardware at so the software can be portable instead of hand optimized. And your product line or embedded operating system will fail if assembler is required for third party success.
The biggest concern I have is that this knowledge is being lost at an alarming rate. Because nobody inspects the assembler, because nobody writes in assembler, etc. Nobody is noticing that the compilers have not been improving when it comes to the code being produced. Developers often think they have to buy more hardware instead of realizing that by either knowing the compiler or how to program better they can improve their performance by 5 to several hundred percent with the same compiler, sometimes with the same source code. 5-10% usually with the same source code and compiler. gcc 4 does not always produce better code than gcc 3, I keep both around because sometimes gcc3 does better. Target specific compilers can (not always do) run circles around gcc, you can see a few hundred percent improvement sometimes with the same source code different compiler. Where does all of this come from? The folks that still bother to look and/or use assembler. Some of those folks work on the compiler backends. The front end and middle are fun and educational certainly, but the backend is where you make or break quality and performance of the resulting program. Even if you never write assembler but only look at the output from the compiler from time to time (gcc -O2 -s myprog.c) it will make you a better high level programmer and will retain some of this knowledge. If nobody is willing to know and write assembler then by definition we have given up in writing and maintaining compilers for high level languages and software in general will cease to exist.
Understand that with gcc for example the output of the compiler is assembly that is passed to an assembler which turns it into object code. The C compiler does not normally produce binaries. The objects when combined into the final binary, are done by the linker, yet another program that is called by the compiler and not part of the compiler. The compiler turns C or C++ or ADA or whatever into assembler then the assembler and linker tools take it the rest of the way. Dynamic recompilers, like tcc for example, must be able to generate binaries on the fly somehow, but I see that as the exception not the rule. LLVM has its own runtime solution as well as quite visibly showing the high level to internal code to target code to binary path if you use it as a cross compiler.
So back to the point, yes it is done, more often than you think. Mostly has to do with the language not comparing directly to the instruction set, and then the compiler not always producing fast enough code. If you can get say dozens of times improvement on heavily used functions like malloc or memcpy. Or want to have a HD video player on your phone without hardware support, balance the pros and cons of assembler. Truly embedded markets still use assembler quite a bit, sometimes it is all C but sometimes the software is completely coded in assembler. For desktop x86, the processor is not the bottleneck. The processors are microcoded. Even if you make beautiful looking assembler on the surface it wont run really fast on all families x86 processors, sloppy, good enough code is more likely to run about the same across the board.
I highly recommend learning assembler for non-x86 ISAs like arm, thumb/thumb2, mips, msp430, avr. Targets that have compilers, particularly ones with gcc or llvm compiler support. Learn the assembler, learn to understand the output of the C compiler, and prove that you can do better by actually modifying that output and testing it. This knowledge will help make your desktop high level code much better without assembler, faster and more reliable.

It depends. It is (still) being done in some situations, but for the most part, it is not worth it. Modern CPUs are insanely complex, and it is equally complex to write efficient assembly code for them. So most of the time, the assembly you write by hand will end up slower than what the compiler can generate for you.
Assuming a decent compiler released within the last couple of years, you can usually tweak your C/C++ code to gain the same performance benefit as you would using assembly.
A lot of people in the comments and answers here are talking about the "N times speedup" they gained rewriting something in assembly, but that by itself doesn't mean too much. I got a 13 times speedup from rewriting a C function evaluating fluid dynamics equations in C, by applying many of the same optimizations as you would if you were to write it in assembly, by knowing the hardware, and by profiling. At the end, it got close enough to the theoretical peak performance of the CPU that there would be no point in rewriting it in assembly. Usually, it's not the language that's the limiting factor, but the actual code you've written. As long as you're not using "special" instructions that the compiler has difficulty with, it's hard to beat well-written C++ code.
Assembly isn't magically faster. It just takes the compiler out of the loop. That is often a bad thing, unless you really know what you're doing, since the compiler performs a lot of optimizations that are really really painful to do manually. But in rare cases, the compiler just doesn't understand your code, and can't generate efficient assembly for it, and then, it might be useful to write some assembly yourself. Other than driver development or the like (where you need to manipulate the hardware directly), the only place I can think of where writing assembly may be worth it is if you're stuck with a compiler that can't generate efficient SSE code from intrinsics (such as MSVC). Even there, I'd still start out using intrinsics in C++, and profile it and try to tweak it as much as possible, but because the compiler just isn't very good at this, it might eventually be worth it to rewrite that code in assembly.

Take a look here, where the guy improved performances 6 times using assembly code. So, the answer is : it is still being done, but the compiler is doing pretty good job.

"Is this practice still done?"
--> It is done in image processing, signal processing, AI (eg. efficient matrix multiplication), and other. I would bet the processing of the scroll gesture on my macbook trackpad is also partially assembly code because it is immediate.
--> It is even done in C# applications (see https://blogs.msdn.microsoft.com/winsdk/2015/02/09/c-and-fastcall-how-to-make-them-work-together-without-ccli-shellcode/)
"Isn't writing in Assembly Language a bit too cumbersome & archaic?"
--> It is a tool like a hammer or a screwdriver and some tasks require a watchmaker screwdriver.
"When we compile C code (with or without -O3 flag), the compiler does some code optimization ... So how does inducing 'Assembly Language' help?"
--> I like what #jalf said, that writing C code in a way you would write assembly will already lead to efficient code. However to do this you must think how you would write the code in assembly language, so eg. understand all places where data is copied (and feel pain each time it is unnecessary).
With assembly language you can be sure which instructions are generated. Even if your C code is efficient there is no guarantee that the resulting assembly will be efficient with every compiler. (see https://lucasmeijer.com/posts/cpp_unity/)
--> With assembly language, when you distribute a binary, you can test for the cpu and make different branches depending on the cpu features as optimized for for AVX or just for SSE, but you only need to distribute one binary. With intrinsics this is also possible in C++ or .NET Core 3. (see https://devblogs.microsoft.com/dotnet/using-net-hardware-intrinsics-api-to-accelerate-machine-learning-scenarios/)

On my work, I used assembly on embedded target (micro controller) for low level access.
But for a PC software, I don't think it is very usefull.

I have an example of assembly optimization I've done, but again it's on an embedded target. You can see some examples of assembly programming for PCs too, and it creates really small and fast programs, but usually not worth the effort (Look for "assembly for windows", you can find some very small and pretty programs).
My example was when I was writing a printer controller, and there was a function that was supposed to be called every 50 micro-seconds. It has to do reshuffling of bits, more or less. Using C I've been able to do it in about 35microseconds, and with assembly I've done it in about 8 microseconds. It's a very specific procedure but still, something real and necessary.

On some embedded devices (phones and PDAs), it's useful because the compilers are not terribly mature, and can generate extremely slow and even incorrect code. I have personally had to work around, or write assembly code to fix, the buggy output of several different compilers for ARM-based embedded platforms.

Yes. Use either inline assembly or link assembly object modules. Which method you should use depends on how much assembly code you need to write. Usually it's OK to use inline assembly for a couple of lines and switch to separate object modules once if it's more than one function.
Definitely, but sometimes it's necessary. The prominent example here would be programming an operating system.
Most compilers today optimize the code you write in a high-level language much better than anyone could ever write assembly code. People mostly use it to write code that would otherwise be impossible to write in a high-level language like C. If someone uses it for anything else means he is either better at optimization than a modern compiler (I doubt that) or just plain stupid, e.g. he doesn't know what compiler flags or function attributes to use.

use this:
__asm__ __volatile__(/*assembly code goes here*/);
the __asm__ can also just be asm.
The __volatile__ stops the compiler from making further optimizations.

How is Assembly used in the modern day (with C/C++ for example)?

I understand how a computer works on the basic principles, such as, a program can be written in a "high" level language like C#, C and then it's broken down in to object code and then binary for the processor to understand. However, I really want to learn about assembly, and how it's used in modern day applications.
I know processors have different instruction sets above the basic x86 instruction set. Do all assembly languages support all instruction sets?
How many assembly languages are there? How many work well with other languages?
How would someone go about writing a routine in assembly, and then compiling it in to object/binary code?
How would someone then reference the functions/routines within that assembly code from a language like C or C++?
How do we know the code we've written in assembly is the fastest it possibly can be?
Are there any recommended books on assembly languages/using them with modern programs?
Sorry for the quantity of questions, I do hope they're general enough to be useful for other people as well as simple enough for others to answer!

However, I really want to learn about assembly, and how it's used in modern day applications.
On "normal" PCs it's used just for time-critical processing, I'd say that realtime multimedia processing can still benefit quite a bit from hand-forged assembly. On embedded systems, where there's a lot less horsepower, it may have more areas of use.
However, keep in mind that it's not just "hey, this code is slow, I'll rewrite it in assembly and it by magic it will go fast": it must be carefully written assembly, written knowing what it's fast and what it's slow on your specific architecture, and keeping in mind all the intricacies of modern processors (branch mispredictions, out of order executions, ...). Often, the assembly written by a beginner-to-medium assembly programmer will be slower than the final machine code generated by a good, modern optimizing compiler. Performance stuff on x86 is often really complicated, and should be left to people who know what they do => and most of them are compiler writers. :) Have a look at this, for example. C++ code for testing the Collatz conjecture faster than hand-written assembly - why? gets into some of the specific x86 details for that case which you have to understand to match or beat a compiler with optimization enabled, for a single small loop.
I know processors have different instruction sets above the basic x86 instruction set. Do all assembly languages support all instruction sets?
I think you're confusing some things here. Many (=all modern) x86 processors support additional instructions and instruction sets that were introduced after the original x86 instruction set was defined. Actually, almost all x86 software now is compiled to exploit post-Pentium features like cmovcc; you can query the processor to see if it supports some features using the CPUID instruction. Obviously, if you want to use a mnemonic for some newer instruction set instruction your assembler (i.e. the software which translates mnemonics in actual machine code) must be aware of them.
Most C compilers have intrinsics like _mm_popcnt_u32 and/or command line options like -mpopcnt to enable them that let you take advantage of new instructions without hand-written asm. x86 -mbmi / -mbmi2 extensions have several instructions that compilers know how to use when optimizing ordinary C like x << y (shlx instead of the more clunky shl) or x &= x-1; (blsr / _blsr_u32()). GCC has a -march=native option to enable all the instruction sets your CPU supports, and to set the -mtune= option to optimize for your CPU in terms of how much loop unrolling is a good idea, or which instructions or sequences are faster on one CPU, slower on another.
If, instead, you're talking about other (non-x86) instruction sets for other families of processors, well, each assembler should support the instructions that the target processor can run. Not all the instructions of an assembly language have direct replacement in others, and in general porting assembly code from an architecture to another is usually a hard and difficult work.
How many assembly languages are there?
Theoretically, at least one dialect for each processor family. Keep in mind that there are also different notations for the same assembly language; for example, the following two instructions are the same x86 stuff written in AT&T and Intel notation:
mov $4, %eax // AT&T notation
mov eax, 4 // Intel notation
How would someone go about writing a routine in assembly, and then compiling it in to object/binary code?
If you want to embed a routine in an application written in another language, you should use the tools that the language provides you, in C/C++ you'd use the asm blocks.
You can instead make stand-alone .s or .asm files using the same syntax a C compiler would output, for example gcc -O3 -S will compile to a .s file that you can assemble with gcc -c. Separate files are a good idea if you want to write whole functions in asm instead of wrapping one or a couple instructions. A few open source projects like x264 and x265 (video encoders) have extensive amounts of NASM source code for different versions of functions for different versions of SSE or AVX available.
If you, instead, wanted to write a whole application in assembly, you'd have to write just in assembly, following the syntactic rules of the assembler you'd like to use.
How do we know the code we've written in assembly is the fastest it possibly can be?
In theory, because it is the nearest to the bare metal, so you can make the machine do just exactly what you want, without having the compiler take in account for language features that in some specific case do not matter. In practice, since the machine is often much more complicated than what the assembly language expose, as I said often assembly language will be slower than compiler-generated machine code, that takes in account many subtleties that the average programmer do not know.
Addendum
I was forgetting: knowing to read assembly, at least a little bit, can be very useful in debugging strange issues that can come up when the optimizer is broken/only in the release build/you have to deal with heisenbugs/when the source-level debugging is not available or other stuff like that; have a look at the comments here.

Intel and the x86 are big on reverse compatibility, which certainly helped them out but at the same time hurts greatly. The internals of the 8088/8086 to 286 to 386, to 486, pentium, pentium pro, etc to the present are somewhat of a redesign each time. Early on adding protection mechanisms for operating systems to protect apps from each other and the kernel, and then into performance by adding execution units, superscalar and all that comes with it, multi core processors, etc. What used to be a real, single AX register in the original processor turns into who knows how many different things in a modern processor. Originally your program was executed in the order written, today it is diced and sliced and executed in parallel in such a way that the intent of the instructions as presented are honored but the execution can be out of order and in parallel. Lots and lots of new tricks buried behind what on the surface appears to be a very old instruction set.
The instruction set changed from the 8/16 bit roots to 32 bit, to 64 bit, so the assembly language had to change as well. Adding EAX to AX, AH, and AL for example. Occasionally other instructions were added. But the original load, store, add, subtract, and, or, etc instructions are all there. I have not done x86 in a long time and was shocked to see that the syntax has changed and/or a particular assembler messed up the x86 syntax. There are a zillion tools out there so if one doesnt match the book or web page you are using, there is one out there that will.
So thinking in terms of assembly language for this family is right and wrong, the assembly language may have changed syntax and is not necessarily reverse compatible, but the instruction set or machine language or other similar terms (the opcodes/bits the assembly represents) would say that much of the original instruction set is still supported on modern x86 processors. 286 specific nuances may not work perhaps, as with other new features of specific generations, but the core instructions, load, store, add, subtract, push, pop, etc all still work and will continue to work. I feel it is better to "Drive down the center of the lane", dont get into chip or tool specific ghee whiz features, use the basic boring, been working since the beginning of time syntax of the language.
Because each generation in the family is trying for certain features, usually performance, the way the individual instructions are handed out to the various execution units changes...on each generation...In order to hand tune assembler for performance, trying to out-do a compiler, can be difficult at best. You need detailed knowledge about the specific processor you are tuning for. From the early x86 days to the present, unfortunately, what made the code execute faster on one chip, would often cause the next generation to run extra slow. Perhaps that was a marketing tool in disguise, not sure, "Buy the hot new processor that cost twice as much as the one you have now, advertises twice the clock speed, but runs your same copy of windows 30% slower. In a few years when the next version of windows is compiled (and this chip is obsolete) it will then double in performance". Another side effect of this is that at this point in time you cannot take one C program and create one binary that runs fast on all x86 processors, for performance you need to tune for the specific processor, meaning you need to at least tell the compiler to optimize and what family to optimize for. And like windows or office, or something you are distributing as a binary you likely cannot or do not want to somehow bury several differently tuned copies of the same program in one package or in one binary...drive down the center of the road.
As a result of all the hardware improvements it may be in your best interest to not try to tune the compiler output or hand assembler to any one chip in particular. On average the hardware improvements will compensate for the lack of compiler tuning and your same program hopefully just runs a little faster each generation. One of the chip vendors used to aim to make todays popular compiled binaries run faster tomorrow, the other vendor improved the internals such that if you recompiled todays source for the new internals you could run faster tomorrow. Those activities between vendors has not necessarily continued, each generation runs todays binaries slower, but tomorrows recompiled source the same speed or slower. It will run tomorrows re-written programs faster, sometimes with the same compiler sometimes you need tomorrows compiler. Isnt this fun!
So how do we know a particular compiled or hand assembled program is as fast as it possibly can be? We dont, in fact for x86 you can guarantee it isnt, run it on one chip in the family and it is slow, run it on another it may be blazing fast. x86 or not, other than very short programs or very deterministic programs like you would find on a microcontroller, you cannot definitely say this is the fastest possible solution. Caches for example are very hard if even possible to tune for, and the memory behind it, particularly on a pc, where the user can choose various sizes, speeds, ranks, banks, etc and adjust bios settings to change even more settings, you really cannot tell a compiler to tune for that. So even on the same computer same processor same compiled binary you have the ability to turn some of the knobs and make that program run a lot faster or a lot slower. Change processor families, change chipsets, motherboards, etc. And there is no possible way to tune for so many variables. The nature of the x86 pc business has become too chaotic.
Other chip families are not nearly as problematic. Some perhaps but not all. So these are not general statements, but specific to the x86 chip family. The x86 family is the exception not the rule. Probably the last assembler/instruction set you would want to bother learning.
There are tons of websites and books on the subject, cannot say one is better than the other. I learned from the original set of 8088/86 books from intel and then the 386 and 486 book, didnt look for Intel books after that (or any other boos). You will want an instruction set reference, and an assembler like nasm or gas (gnu assembler, part of binutils that comes with most gcc based compiler toolchains). As far as the C to/from assembler interface you can if nothing else figure that out by experimenting, write a small C program with a few small C functions, disassemble or compile to assembler, and look at what registers and/or how the stack is used to pass parameters between functions. Keep your functions simple and use only a few parameters and your assembler will likely work just fine. If not look at the assembler of the function calling your code and figure out where your parameters are. It is all well documented somewhere, and these days probably much better than old. In the early 8088/86 days you had tiny, small, medium, large and huge compiler models and the calling conventions could vary from one to the other. As well as one compiler to the next, watcom (formerly zortech and perhaps other names) was pass by register, borland and microsoft were passed on the stack and pretty close if not the same. Now with 32 and 64 bit flat memory space, and standards, you can use one model and not have to memorize all the nuances (just one set of nuances). Inline assembly is an option but varies from C compiler to C compiler, and getting it to work properly and effectively is more difficult than just writing assembler in its own file. gcc and perhaps other compilers will allow you to put the assembler file on the C compiler command line as if it were just another C file and it will figure out what you have given it and pass it to the assembler for you. That is if you dont want to call the assembler program yourself and put the object on the C compiler command line.
if nothing else disassemble a lot of simple functions, add a few parameters and return them, etc. Change compiler optimization settings and see how that changes the instructions used, often dramatically. Even if you cannot write assembler from scratch being able to read it is very valuable, both from a debugging and performance perspective.
Not all compilers for all processors are good. Gcc for example is a one size fits all, just like a sock or ball cap that one size doesnt really fit anyone well. Does pretty good for most of the targets but not really great. So it is quite possible to do better than the compiler with hand tuned assembler, but on the average for lots of code you are not going to win. That applies to most processors, which are more deterministic, not just the x86 family. It is not about fewer instructions, fewer instructions does not necessarily equate to faster, to outperform even an average compiler in the long run you have to understand the caches, fetch, decode, execution state machines, memory interfaces, memories themselves, etc. With compiler optimizations turned off it is very easy to produce faster code than the compiler, so you should just use the optimizer but also understand that that increases the risk of the compiler making a mistake. You need to know the tool very well, which goes back to disassebling often to understand how your C code and the compiler you use today interact with each other. No compiler is completely standards compliant, because the standards themselves are fuzzy, leaving some features of the language up to the discretion of the compiler (drive down the middle of the road and dont use those parts of the language).
Bottom line from the nature of your questions, I would recommend writing a bunch of small functions or programs with some small functions, compile to assembler or compile to an object and disassemble to see what the compiler does. Be sure to use different optimization settings on each program. Gain a working reading knowledge of the instruction set (granted the asm output of the compiler or disassembler, has a lot of extra fluff that gets in the way of readability, you have to look past that, you need almost none of it if you want to write assembler). Give yourself 5-20 years of studying and experimenting before you can expect to outperform the compiler on a regular basis, if that is your goal. By then you will learn that, particularly with this chip family, it is a futile effort, you win a few but mostly lose...It would be to your benefit to compile (to assembler) the same code to other chip families like arm and mips, and get a general feel for what C code compiles well in general, and what C code doesnt compile well, and make your C programming better instead of trying to make the assembler better. Also try other compilers like llvm. Gcc has a lot of quirks that many think are the C language standards but are instead nuances or problems with the specific compiler. Being able to read and analyze the assembly output of the compilers and their options will provide this knowledge. So I recommend you work on a reading knowledge of the instruction set, without necessarily having to learn to write it from scratch.

You need to look upon it from the hardware's point of view, the assembly language is created with regard to what the CPU can do. Every time a new feature in a CPU is created an appropriate assembly instruction is created so that it can be used.
Assembly is thus very dependent on the CPU, the high level languages like C++ provides abstractions from this to allow us to not have to think about the details like CPU instructions as well as the compiler generates optimized assembly code.
EDIT:
How many assembly languages are there?
How many work well with other
languages?
as many as there are different types of CPU. The second question I didn't understand. Assembly per se is not interacting with any other language, the output, the machine code is.
How would someone go about writing a
routine in assembly, and then
compiling it in to object/binary
code?`
The principle is similar to writing in any other compiled language, you create a text file with the assembly instructions, use an assembler to compile it to machine code. Then link it with eventual runtime libraries.
How would someone then reference the functions/routines within that
assembly code from a language like C
or C++?
C++ and C provide inline assembly so there is no need to link, but if you want to link you need to create the assembly object following the same/similar calling conventions as the host language. For instance some languages when calling a function push the arguments to the function on the stack in a certain order, so you would have to do the same.
How do we know the code we've written
in assembly is the fastest it possibly
can be?
Because it is closest to the actual hardware. When you are dealing with higher level languages you don't know what the compiler will do with your for loop. However more often than not they do a good and better job of optimizing the code than a human can do (of course in very special circumstances you can probably get a better result).

There are many many different assembly languages out there. Usually there is at least one for every processor instruction set, which means one for every processor type. One thing that you should also keep in mind is that even for a single processor there may be several different assembler programs that may use a different syntax, which from a formal view constitutes a different language. (for x86 there are masm, nasm, yasm, AT&T (what *nix assemblers like the GNU assembler use by default), and probably many more)
For x86 there are lots of different instruction sets because there have been so many changes to the architecture over the years. Some of these changes could be viewed mostly as additional instructions, so they are a super set of the previous assembly. Other changes may actually remove instructions (none are coming to mind for x86, but I've heard of some on other processors). And other changes add modes of operation to processors that make things even more complicated.
There are also other processors with completely different instructions.
To learn assembly you will need to start by picking a target processor and an assembler that you want to use. I'm going to assume that you are going to use x86, so you would need to decide if you want to start with 16 bit segmented, 32 bit, or 64 bit. Many books and online tutorials go the 16 bit route where you write DOS programs. If you are wanting to write parts of C programs in assembly then you will probably want to go the 32 or 64 bit route.
Most of the assembly programming I do is inline in C to either optimize something, to make use of instructions that the compiler doesn't know about, or when I otherwise need to control the instructions used. Writing large amounts of code in assembly is difficult, so I let the C compiler do most of the work.
There are lots of places where assembly is still written by people. This is particularly common in embedded, boot loaders (bios, u-boot, ...), and operating system code, though many developers in these never directly write any assembly. This code may be start up code that has to run before the stack pointer is set to a usable value (or RAM isn't usable yet for some other reason), because they need to fit within small spaces, and/or because they need to talk to hardware in ways that aren't directly supported in C or other higher level languages. Other places where assembly is used in OSes is writing locks (spinlocks, critical sections, mutexes, and semaphores) and context switching (switching from one thread of execution to another).
Other places where assembly is commonly written is in the implementation of some library code. Functions like strcpy are often implemented in assembly for different architectures because there are often several ways that they may be optimized using processor specific operations, while a C implementation might use a more general loop. These functions are also reused so often that optimizing them by hand is often worth the effort in the long run.
Another, related, place where lots of assembly is written is within compilers. Compilers have to know how to implement things and many of them produce assembly, so they have assembly templates (or something similar) built into them for use in generating output code.
Even if you never write any assembly knowing the instructions and registers of your target system are often useful. They can aid in debugging, but they can also aid in writing code. Knowing the target processor can help you write better (smaller and/or faster) code for it (even in a higher level language), and being familiar with a few different processors will help you to write code that will be good for many processors because you will know generally how CPUs work.

We do a fair bit of it in our Real-Time work (more than we should really). A wee bit of assembly can also be quite useful when you are talking to hardware, and need specific machine instructions executed (eg: All writes must be 16-bit writes, or you'll hose nearby registers).
What I tend to see today is assembly insertions in higher-level language code. How exactly this is done depends on your language and sometimes compiler.

I know processors have different
instruction sets above the basic x86
instruction set. Do all assembly
languages support all instruction
sets?
"Assembly language" is a kind of misnomer, at least in the way you are using it. Assemblers are less of a language (CS graduates may object) and more of a converter tool which takes textual representation and generates a binary image from it, with a close to 1:1 relationship between text elements (memnonics, labels and numbers) and binary elements. There is no deeper logic behind the elements of an assembler language because their possibilities to be quoted and redirected ends mostly at level 1; you can, for example, use EAX only in one instruction at a time - the next use of EAX in the next instruction bears no relationship with its previous use EXCEPT for the unwritten logical connection which the programmer had in mind - this is the reason why it is so easy to create bugs in assembler.
How would someone go about writing a
routine in assembly, and then
compiling it in to object/binary code?
One would need to pin down the lowest common denominator of instruction sets and code the function times the expected architectures the code is intended to run on. IOW if you are not coding for a certain hardware platform which is defined at the time of writing (e.g. a game console, an embedded board) you no longer do this.
How would someone then reference the
functions/routines within that
assembly code from a language like C
or C++?
You need to declare them in your HLL - see your compilers handbook.
How do we know the code we've written
in assembly is the fastest it possibly
can be?
There is no way to know. Be happy about that and code on.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js