Can I control what gets copied into CPU cache in C++?

Can I control what gets copied into CPU cache in C++? - c++

I read about cache optimization in C++ and the mechanisms, modern CPUs use to predict what data is needed next, to copy that into cache. But is there a direct way in C++ for the programmers, who know what actually is needed next, to determine what data gets copied into CPU cache?

This varies with the processor and compiler you're using.
Assuming you're using an Intel x86/x64 or compatible (e.g., AMD) processor, the processor provides a number of prefetch instructions, and most compilers include intrinsics to invoke them. With VC++ you use _m_prefetch or _m_prefetchw. With gcc you use __builtin_prefetch.
Likewise, VC++ on an ARM provides a __prefetch intrinsic for the same purpose (no, I really don't know why they couldn't have used the same name as on x86; the signature and effect appear identical).
Most other reasonably modern, higher-end processors probably provide similar instructions, and
I'd guess most compilers provide intrinsics to make them available, but just as with these, the names of the intrinsics will vary. For that matter, even though the functions are intrinsic to the compiler, most require that you include some header to use them -- and the name of the header will also vary.

The prefetch intrinsics Jerry provided would do the trick. keep in mind that there are several flavors controlled by an argument to that function, determining which levels of the cache (if any) would be used to keep the line. A prefetch_NTA for e.g. would not pollute the caches, but rather provide the line only for immediate use (and is used in cases where you're going to use it soon and once only)
Also keep in mind that these instructions are basically hints to the CPU (which also does quite well by itself trying to guess which lines to prefetch). As such, they are not guaranteed to work, they might fail in many cases (if the memory subsystem is loaded, or the address got swapped out of memory).

Related

Choose assembly implementation to use based on supported instructions

I am working on a C library which compiles/links to a .a file that users can statically link into their code. The library's performance is very important, so I am writing performance-critical routines in x86-64 assembly to optimize performance.
For some routines, I can get significantly better performance if I use BMI2 instructions than if I stick to the "standard" x86-64 instruction set. Trouble is, BMI2 was introduced fairly recently and some of my users use processors that do not support those instructions.
So, I've written optimized the routines twice, once using BMI2 instructions and once without using them. In my current setup, I would distribute two versions of the .a file: a "fast" one that requires support for BMI2 instructions, and a "slow" one that does not require support for BMI2 instructions.
I am asking if there's a way to simplify this by distributing a single .a file that will dynamically choose the correct implementation based on whether the CPU on which the final application runs supports BMI2 instructions.
Unlike similar questions on StackOverflow, there are two peculiarities here:
The technique to choose the function needs to have particularly low overhead in the critical path. The routines in question, after assembly-optimization, run in ~10 ns, so even a single if statement could be significant.
The function that needs to be chosen "dynamically" is chosen once at the beginning, and then remains fixed for the duration of the program. I'm hoping that this will offer a faster solution than the one suggested in this question: Choosing method implementation at runtime
The fastest solution I've come up with so far is to do the following:
Check whether the CPU supports BMI2 instructions using the cpuid instruction.
Set a global variable true or false depending on the result.
Branch on the value of this global variable on every function invocation.
I'm not satisfied with this approach because it has two drawbacks:
I'm not sure how I can automatically run cpuid and set a global variable at the beginning of the program, given that I'm distributing a .a file and don't have control over the main function in the final binary. I'm happy to use C++ here if it offers a better solution, as long as the final library can still be linked with and called from a C program.
This incurs overhead on every function call, when ideally the only overhead would be on program startup.
Are there any solutions that are more efficient than the one I've detailed above?

x264 uses an init function (which users of the library are required to call before calling anything else, or something like that) to set up a struct of function pointers based on CPUID results. Including taking into account that pshufb is slow on some early CPUs that support it.
If your functions depend on pdep / pext, you probably want to detect AMD vs. Intel, because AMD's pdep/pext is very slow and probably not worth using on Ryzen, even though it is available. (See https://agner.org/optimize/ for instruction tables.)
Function pointers are fairly low overhead, about the same as calling a function in a shared library or DLL. call [rel funcptr] instead of call func. (In the compiler-generated asm that calls your functions).
CPU dependent code: how to avoid function pointers? shows a very simple example of it in C, and is asking for ways to avoid it. With dynamic linking, you can do CPU detection at dynamic link time so the dynamic-linking indirection becomes your CPU-dispatch indirection as well (like glibc does for selecting an optimized memcpy implementation.)
But with static linking for a .a, just make function pointers that are statically initialized to the baseline versions, and your CPU init function (which hopefully runs before any of the function pointers are dereferenced) rewrites them to point at the best version for the current CPU.

If you are using gcc, you can get the compiler to implement all the boiler plate code automatically. gcc manual page on function multiversioning

Backward compatibility of the code compiled optimized for new instruction set extensions

In order to narrow the scope of this question, let's consider projects in C / C++ only.
There is a whole array of new SIMD instruction set extensions for x86 architecture, though in order to benefit from them a developer should recompile the code with an appropriate optimization flag, and perhaps, modify it accordingly as well.
Since new instruction set extensions come out relatively frequently, it's unclear how the backward compatibility can be maintained while utilizing the benefits of available instruction set extensions.
Is a resulting application stays compatible with the older CPU models that don't support a new institution set extension? If yes, could you elaborate on how such support implemented?

New CPU instructions require new hardware to execute. If you try to run them on older CPUs that don't support those instructions, your program will crash with an Invalid Opcode fault. Occasionally OSes will handle this condition, but usually not.
To run with the new instructions, you either need to require that they are supported in hardware, or (if the benefit is great enough) check at runtime to see if the new instructions you need are supported. If they are, you run a section of code that uses them. If they are not, you run a different section of code that does not use them.
Generally "backwards compatible" refers to a new version of something running stuff that runs on the older, existing things, and not old things running with new stuff.

Historically, most x86 instruction sets have been (practically) strict supersets of previous sets. However, the AVX-512 extension comes in several mutually-incompatible variants, so particular care will need to be taken.
Fortunately, compilers are also getting smarter. GCC has __attribute__((simd)) and __attribute__((target_clones(...))) to automatically create multiple implementations of the given function, and choose the best one at load time based on what the actual CPU supports. (For older GCC versions, you had to use IFUNC manually ... and in ancient days, ld.so would load libraries from a completely separate directory depending on things like cmov).

Compiler optimization of functions parameters

Function parameters are placed on the stack, but compilers can optimize this task by the use of optional registers. It would make sense that this optimization will kick in if there are only 1-2 parameters, and not when there are 256 (not that one would want to have the max number of parameters).
How can one find out the parameter limit (number of parameters) for a certain compiler (such as gcc) where one can be sure that this optimization will be used?

Function parameters are placed on the stack, but compilers can optimize this task by the use of optional registers.
As FrankH says in his comments and as I'm going to say in my answer, the application binary interface for the system in question determines how arguments are passed to functions - this is called the calling convention for that platform.
To complicate matters, x86 32-bit actually has several. This is historical and comes from the fact that when Win32 bit arrived, everyone went crazy doing different things.
So, yes, you can "optimise" by writing function calls in such a way, but no, you shouldn't. You should follow the standards for your platform. Because the honest truth is, the speed of stack access probably isn't slowing your code down to that great an extent that you need to be binary-incompatible from everyone else on your system.
Why the need for ABIs/standard calling conventions? Well, in terms of using the processor registers, stack etc, applications must agree on what means what and where it shoudl go. If one function decided all its arguments were in registers and another that some were on the stack, how would they be interoperable? Moreover, you might come across the term scratch registers to mean those registers you don't have to restore. What happens if you call a function expecting it to leave some registers alone?
Anyway, as for what you asked for, here's some ABI documentation:
The difference between x86 and x64 on windows.
x86_64 ABI used for Unix-like platforms.
Wikipedia's x86 calling conventions.
A document on compiler calling conventions.
The last one is my favourite. To quote it:
In the days of the old DOS operating system, it was often possible to combine development
tools from different vendors with few compatibility problems. With 32-bit Windows, the
situation has gone completely out of hand. Different compilers use different data
representations, different function calling conventions, and different object file formats.
While static link libraries have traditionally been considered compiler-specific, the
widespread use of dynamic link libraries (DLL's) has made the distribution of function
libraries in binary form more common.
So whatever you're trying to do with optimising via modifying the function calling method, don't. Find another way to optimise. Profile your code. Study the compiler optimisations you've got for your compiler (-OX) if you think it helps and dump the assembly to check, if the speed is really that crucial

For publically visible functions, this is documented in the ABI standard. For functions that are not referencible from the outside, all bets are off anyway.

You would have to read the fine manual for the compiler. If you were lucky, you would find it there in a description of function calling conventions. Otherwise, for an OSS compiler such as gcc you would probably have to read its source-code.

How is Assembly used in the modern day (with C/C++ for example)?

I understand how a computer works on the basic principles, such as, a program can be written in a "high" level language like C#, C and then it's broken down in to object code and then binary for the processor to understand. However, I really want to learn about assembly, and how it's used in modern day applications.
I know processors have different instruction sets above the basic x86 instruction set. Do all assembly languages support all instruction sets?
How many assembly languages are there? How many work well with other languages?
How would someone go about writing a routine in assembly, and then compiling it in to object/binary code?
How would someone then reference the functions/routines within that assembly code from a language like C or C++?
How do we know the code we've written in assembly is the fastest it possibly can be?
Are there any recommended books on assembly languages/using them with modern programs?
Sorry for the quantity of questions, I do hope they're general enough to be useful for other people as well as simple enough for others to answer!

However, I really want to learn about assembly, and how it's used in modern day applications.
On "normal" PCs it's used just for time-critical processing, I'd say that realtime multimedia processing can still benefit quite a bit from hand-forged assembly. On embedded systems, where there's a lot less horsepower, it may have more areas of use.
However, keep in mind that it's not just "hey, this code is slow, I'll rewrite it in assembly and it by magic it will go fast": it must be carefully written assembly, written knowing what it's fast and what it's slow on your specific architecture, and keeping in mind all the intricacies of modern processors (branch mispredictions, out of order executions, ...). Often, the assembly written by a beginner-to-medium assembly programmer will be slower than the final machine code generated by a good, modern optimizing compiler. Performance stuff on x86 is often really complicated, and should be left to people who know what they do => and most of them are compiler writers. :) Have a look at this, for example. C++ code for testing the Collatz conjecture faster than hand-written assembly - why? gets into some of the specific x86 details for that case which you have to understand to match or beat a compiler with optimization enabled, for a single small loop.
I know processors have different instruction sets above the basic x86 instruction set. Do all assembly languages support all instruction sets?
I think you're confusing some things here. Many (=all modern) x86 processors support additional instructions and instruction sets that were introduced after the original x86 instruction set was defined. Actually, almost all x86 software now is compiled to exploit post-Pentium features like cmovcc; you can query the processor to see if it supports some features using the CPUID instruction. Obviously, if you want to use a mnemonic for some newer instruction set instruction your assembler (i.e. the software which translates mnemonics in actual machine code) must be aware of them.
Most C compilers have intrinsics like _mm_popcnt_u32 and/or command line options like -mpopcnt to enable them that let you take advantage of new instructions without hand-written asm. x86 -mbmi / -mbmi2 extensions have several instructions that compilers know how to use when optimizing ordinary C like x << y (shlx instead of the more clunky shl) or x &= x-1; (blsr / _blsr_u32()). GCC has a -march=native option to enable all the instruction sets your CPU supports, and to set the -mtune= option to optimize for your CPU in terms of how much loop unrolling is a good idea, or which instructions or sequences are faster on one CPU, slower on another.
If, instead, you're talking about other (non-x86) instruction sets for other families of processors, well, each assembler should support the instructions that the target processor can run. Not all the instructions of an assembly language have direct replacement in others, and in general porting assembly code from an architecture to another is usually a hard and difficult work.
How many assembly languages are there?
Theoretically, at least one dialect for each processor family. Keep in mind that there are also different notations for the same assembly language; for example, the following two instructions are the same x86 stuff written in AT&T and Intel notation:
mov $4, %eax // AT&T notation
mov eax, 4 // Intel notation
How would someone go about writing a routine in assembly, and then compiling it in to object/binary code?
If you want to embed a routine in an application written in another language, you should use the tools that the language provides you, in C/C++ you'd use the asm blocks.
You can instead make stand-alone .s or .asm files using the same syntax a C compiler would output, for example gcc -O3 -S will compile to a .s file that you can assemble with gcc -c. Separate files are a good idea if you want to write whole functions in asm instead of wrapping one or a couple instructions. A few open source projects like x264 and x265 (video encoders) have extensive amounts of NASM source code for different versions of functions for different versions of SSE or AVX available.
If you, instead, wanted to write a whole application in assembly, you'd have to write just in assembly, following the syntactic rules of the assembler you'd like to use.
How do we know the code we've written in assembly is the fastest it possibly can be?
In theory, because it is the nearest to the bare metal, so you can make the machine do just exactly what you want, without having the compiler take in account for language features that in some specific case do not matter. In practice, since the machine is often much more complicated than what the assembly language expose, as I said often assembly language will be slower than compiler-generated machine code, that takes in account many subtleties that the average programmer do not know.
Addendum
I was forgetting: knowing to read assembly, at least a little bit, can be very useful in debugging strange issues that can come up when the optimizer is broken/only in the release build/you have to deal with heisenbugs/when the source-level debugging is not available or other stuff like that; have a look at the comments here.

Intel and the x86 are big on reverse compatibility, which certainly helped them out but at the same time hurts greatly. The internals of the 8088/8086 to 286 to 386, to 486, pentium, pentium pro, etc to the present are somewhat of a redesign each time. Early on adding protection mechanisms for operating systems to protect apps from each other and the kernel, and then into performance by adding execution units, superscalar and all that comes with it, multi core processors, etc. What used to be a real, single AX register in the original processor turns into who knows how many different things in a modern processor. Originally your program was executed in the order written, today it is diced and sliced and executed in parallel in such a way that the intent of the instructions as presented are honored but the execution can be out of order and in parallel. Lots and lots of new tricks buried behind what on the surface appears to be a very old instruction set.
The instruction set changed from the 8/16 bit roots to 32 bit, to 64 bit, so the assembly language had to change as well. Adding EAX to AX, AH, and AL for example. Occasionally other instructions were added. But the original load, store, add, subtract, and, or, etc instructions are all there. I have not done x86 in a long time and was shocked to see that the syntax has changed and/or a particular assembler messed up the x86 syntax. There are a zillion tools out there so if one doesnt match the book or web page you are using, there is one out there that will.
So thinking in terms of assembly language for this family is right and wrong, the assembly language may have changed syntax and is not necessarily reverse compatible, but the instruction set or machine language or other similar terms (the opcodes/bits the assembly represents) would say that much of the original instruction set is still supported on modern x86 processors. 286 specific nuances may not work perhaps, as with other new features of specific generations, but the core instructions, load, store, add, subtract, push, pop, etc all still work and will continue to work. I feel it is better to "Drive down the center of the lane", dont get into chip or tool specific ghee whiz features, use the basic boring, been working since the beginning of time syntax of the language.
Because each generation in the family is trying for certain features, usually performance, the way the individual instructions are handed out to the various execution units changes...on each generation...In order to hand tune assembler for performance, trying to out-do a compiler, can be difficult at best. You need detailed knowledge about the specific processor you are tuning for. From the early x86 days to the present, unfortunately, what made the code execute faster on one chip, would often cause the next generation to run extra slow. Perhaps that was a marketing tool in disguise, not sure, "Buy the hot new processor that cost twice as much as the one you have now, advertises twice the clock speed, but runs your same copy of windows 30% slower. In a few years when the next version of windows is compiled (and this chip is obsolete) it will then double in performance". Another side effect of this is that at this point in time you cannot take one C program and create one binary that runs fast on all x86 processors, for performance you need to tune for the specific processor, meaning you need to at least tell the compiler to optimize and what family to optimize for. And like windows or office, or something you are distributing as a binary you likely cannot or do not want to somehow bury several differently tuned copies of the same program in one package or in one binary...drive down the center of the road.
As a result of all the hardware improvements it may be in your best interest to not try to tune the compiler output or hand assembler to any one chip in particular. On average the hardware improvements will compensate for the lack of compiler tuning and your same program hopefully just runs a little faster each generation. One of the chip vendors used to aim to make todays popular compiled binaries run faster tomorrow, the other vendor improved the internals such that if you recompiled todays source for the new internals you could run faster tomorrow. Those activities between vendors has not necessarily continued, each generation runs todays binaries slower, but tomorrows recompiled source the same speed or slower. It will run tomorrows re-written programs faster, sometimes with the same compiler sometimes you need tomorrows compiler. Isnt this fun!
So how do we know a particular compiled or hand assembled program is as fast as it possibly can be? We dont, in fact for x86 you can guarantee it isnt, run it on one chip in the family and it is slow, run it on another it may be blazing fast. x86 or not, other than very short programs or very deterministic programs like you would find on a microcontroller, you cannot definitely say this is the fastest possible solution. Caches for example are very hard if even possible to tune for, and the memory behind it, particularly on a pc, where the user can choose various sizes, speeds, ranks, banks, etc and adjust bios settings to change even more settings, you really cannot tell a compiler to tune for that. So even on the same computer same processor same compiled binary you have the ability to turn some of the knobs and make that program run a lot faster or a lot slower. Change processor families, change chipsets, motherboards, etc. And there is no possible way to tune for so many variables. The nature of the x86 pc business has become too chaotic.
Other chip families are not nearly as problematic. Some perhaps but not all. So these are not general statements, but specific to the x86 chip family. The x86 family is the exception not the rule. Probably the last assembler/instruction set you would want to bother learning.
There are tons of websites and books on the subject, cannot say one is better than the other. I learned from the original set of 8088/86 books from intel and then the 386 and 486 book, didnt look for Intel books after that (or any other boos). You will want an instruction set reference, and an assembler like nasm or gas (gnu assembler, part of binutils that comes with most gcc based compiler toolchains). As far as the C to/from assembler interface you can if nothing else figure that out by experimenting, write a small C program with a few small C functions, disassemble or compile to assembler, and look at what registers and/or how the stack is used to pass parameters between functions. Keep your functions simple and use only a few parameters and your assembler will likely work just fine. If not look at the assembler of the function calling your code and figure out where your parameters are. It is all well documented somewhere, and these days probably much better than old. In the early 8088/86 days you had tiny, small, medium, large and huge compiler models and the calling conventions could vary from one to the other. As well as one compiler to the next, watcom (formerly zortech and perhaps other names) was pass by register, borland and microsoft were passed on the stack and pretty close if not the same. Now with 32 and 64 bit flat memory space, and standards, you can use one model and not have to memorize all the nuances (just one set of nuances). Inline assembly is an option but varies from C compiler to C compiler, and getting it to work properly and effectively is more difficult than just writing assembler in its own file. gcc and perhaps other compilers will allow you to put the assembler file on the C compiler command line as if it were just another C file and it will figure out what you have given it and pass it to the assembler for you. That is if you dont want to call the assembler program yourself and put the object on the C compiler command line.
if nothing else disassemble a lot of simple functions, add a few parameters and return them, etc. Change compiler optimization settings and see how that changes the instructions used, often dramatically. Even if you cannot write assembler from scratch being able to read it is very valuable, both from a debugging and performance perspective.
Not all compilers for all processors are good. Gcc for example is a one size fits all, just like a sock or ball cap that one size doesnt really fit anyone well. Does pretty good for most of the targets but not really great. So it is quite possible to do better than the compiler with hand tuned assembler, but on the average for lots of code you are not going to win. That applies to most processors, which are more deterministic, not just the x86 family. It is not about fewer instructions, fewer instructions does not necessarily equate to faster, to outperform even an average compiler in the long run you have to understand the caches, fetch, decode, execution state machines, memory interfaces, memories themselves, etc. With compiler optimizations turned off it is very easy to produce faster code than the compiler, so you should just use the optimizer but also understand that that increases the risk of the compiler making a mistake. You need to know the tool very well, which goes back to disassebling often to understand how your C code and the compiler you use today interact with each other. No compiler is completely standards compliant, because the standards themselves are fuzzy, leaving some features of the language up to the discretion of the compiler (drive down the middle of the road and dont use those parts of the language).
Bottom line from the nature of your questions, I would recommend writing a bunch of small functions or programs with some small functions, compile to assembler or compile to an object and disassemble to see what the compiler does. Be sure to use different optimization settings on each program. Gain a working reading knowledge of the instruction set (granted the asm output of the compiler or disassembler, has a lot of extra fluff that gets in the way of readability, you have to look past that, you need almost none of it if you want to write assembler). Give yourself 5-20 years of studying and experimenting before you can expect to outperform the compiler on a regular basis, if that is your goal. By then you will learn that, particularly with this chip family, it is a futile effort, you win a few but mostly lose...It would be to your benefit to compile (to assembler) the same code to other chip families like arm and mips, and get a general feel for what C code compiles well in general, and what C code doesnt compile well, and make your C programming better instead of trying to make the assembler better. Also try other compilers like llvm. Gcc has a lot of quirks that many think are the C language standards but are instead nuances or problems with the specific compiler. Being able to read and analyze the assembly output of the compilers and their options will provide this knowledge. So I recommend you work on a reading knowledge of the instruction set, without necessarily having to learn to write it from scratch.

You need to look upon it from the hardware's point of view, the assembly language is created with regard to what the CPU can do. Every time a new feature in a CPU is created an appropriate assembly instruction is created so that it can be used.
Assembly is thus very dependent on the CPU, the high level languages like C++ provides abstractions from this to allow us to not have to think about the details like CPU instructions as well as the compiler generates optimized assembly code.
EDIT:
How many assembly languages are there?
How many work well with other
languages?
as many as there are different types of CPU. The second question I didn't understand. Assembly per se is not interacting with any other language, the output, the machine code is.
How would someone go about writing a
routine in assembly, and then
compiling it in to object/binary
code?`
The principle is similar to writing in any other compiled language, you create a text file with the assembly instructions, use an assembler to compile it to machine code. Then link it with eventual runtime libraries.
How would someone then reference the functions/routines within that
assembly code from a language like C
or C++?
C++ and C provide inline assembly so there is no need to link, but if you want to link you need to create the assembly object following the same/similar calling conventions as the host language. For instance some languages when calling a function push the arguments to the function on the stack in a certain order, so you would have to do the same.
How do we know the code we've written
in assembly is the fastest it possibly
can be?
Because it is closest to the actual hardware. When you are dealing with higher level languages you don't know what the compiler will do with your for loop. However more often than not they do a good and better job of optimizing the code than a human can do (of course in very special circumstances you can probably get a better result).

There are many many different assembly languages out there. Usually there is at least one for every processor instruction set, which means one for every processor type. One thing that you should also keep in mind is that even for a single processor there may be several different assembler programs that may use a different syntax, which from a formal view constitutes a different language. (for x86 there are masm, nasm, yasm, AT&T (what *nix assemblers like the GNU assembler use by default), and probably many more)
For x86 there are lots of different instruction sets because there have been so many changes to the architecture over the years. Some of these changes could be viewed mostly as additional instructions, so they are a super set of the previous assembly. Other changes may actually remove instructions (none are coming to mind for x86, but I've heard of some on other processors). And other changes add modes of operation to processors that make things even more complicated.
There are also other processors with completely different instructions.
To learn assembly you will need to start by picking a target processor and an assembler that you want to use. I'm going to assume that you are going to use x86, so you would need to decide if you want to start with 16 bit segmented, 32 bit, or 64 bit. Many books and online tutorials go the 16 bit route where you write DOS programs. If you are wanting to write parts of C programs in assembly then you will probably want to go the 32 or 64 bit route.
Most of the assembly programming I do is inline in C to either optimize something, to make use of instructions that the compiler doesn't know about, or when I otherwise need to control the instructions used. Writing large amounts of code in assembly is difficult, so I let the C compiler do most of the work.
There are lots of places where assembly is still written by people. This is particularly common in embedded, boot loaders (bios, u-boot, ...), and operating system code, though many developers in these never directly write any assembly. This code may be start up code that has to run before the stack pointer is set to a usable value (or RAM isn't usable yet for some other reason), because they need to fit within small spaces, and/or because they need to talk to hardware in ways that aren't directly supported in C or other higher level languages. Other places where assembly is used in OSes is writing locks (spinlocks, critical sections, mutexes, and semaphores) and context switching (switching from one thread of execution to another).
Other places where assembly is commonly written is in the implementation of some library code. Functions like strcpy are often implemented in assembly for different architectures because there are often several ways that they may be optimized using processor specific operations, while a C implementation might use a more general loop. These functions are also reused so often that optimizing them by hand is often worth the effort in the long run.
Another, related, place where lots of assembly is written is within compilers. Compilers have to know how to implement things and many of them produce assembly, so they have assembly templates (or something similar) built into them for use in generating output code.
Even if you never write any assembly knowing the instructions and registers of your target system are often useful. They can aid in debugging, but they can also aid in writing code. Knowing the target processor can help you write better (smaller and/or faster) code for it (even in a higher level language), and being familiar with a few different processors will help you to write code that will be good for many processors because you will know generally how CPUs work.

We do a fair bit of it in our Real-Time work (more than we should really). A wee bit of assembly can also be quite useful when you are talking to hardware, and need specific machine instructions executed (eg: All writes must be 16-bit writes, or you'll hose nearby registers).
What I tend to see today is assembly insertions in higher-level language code. How exactly this is done depends on your language and sometimes compiler.

I know processors have different
instruction sets above the basic x86
instruction set. Do all assembly
languages support all instruction
sets?
"Assembly language" is a kind of misnomer, at least in the way you are using it. Assemblers are less of a language (CS graduates may object) and more of a converter tool which takes textual representation and generates a binary image from it, with a close to 1:1 relationship between text elements (memnonics, labels and numbers) and binary elements. There is no deeper logic behind the elements of an assembler language because their possibilities to be quoted and redirected ends mostly at level 1; you can, for example, use EAX only in one instruction at a time - the next use of EAX in the next instruction bears no relationship with its previous use EXCEPT for the unwritten logical connection which the programmer had in mind - this is the reason why it is so easy to create bugs in assembler.
How would someone go about writing a
routine in assembly, and then
compiling it in to object/binary code?
One would need to pin down the lowest common denominator of instruction sets and code the function times the expected architectures the code is intended to run on. IOW if you are not coding for a certain hardware platform which is defined at the time of writing (e.g. a game console, an embedded board) you no longer do this.
How would someone then reference the
functions/routines within that
assembly code from a language like C
or C++?
You need to declare them in your HLL - see your compilers handbook.
How do we know the code we've written
in assembly is the fastest it possibly
can be?
There is no way to know. Be happy about that and code on.

Taking advantage of SSE and other CPU extensions

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
I believe that using SSE to implement these loops should improve their performance significantly, especially where many operations are carried out on the same set of data, so once the data is read into the cache initially, there shouldn't be any cache misses to stall it. However I'm not sure about going about this.
Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.
I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?
What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?

For your second point there are several solutions as long as you can separate out the differences into different functions:
plain old C function pointers
dynamic linking (which generally relies on C function pointers)
if you're using C++, having different classes that represent the support for different architectures and using virtual functions can help immensely with this.
Note that because you'd be relying on indirect function calls, the functions that abstract the different operations generally need to represent somewhat higher level functionality or you may lose whatever gains you get from the optimized instruction in the call overhead (in other words don't abstract the individual SSE operations - abstract the work you're doing).
Here's an example using function pointers:
typedef int (*scale_func_ptr)( int scalar, int* pData, int count);
int non_sse_scale( int scalar, int* pData, int count)
{
// do whatever work needs done, without SSE so it'll work on older CPUs
return 0;
}
int sse_scale( int scalar, in pData, int count)
{
// equivalent code, but uses SSE
return 0;
}
// at initialization
scale_func_ptr scale_func = non_sse_scale;
if (useSSE) {
scale_func = sse_scale;
}
// now, when you want to do the work:
scale_func( 12, theData_ptr, 512); // this will call the routine that tailored to SSE
// if the CPU supports it, otherwise calls the non-SSE
// version of the function

Good reading on the subject: Stop the instruction set war
Short overview: Sorry, it is not possible to solve your problem in simple and most compatible (Intel vs. AMD) way.

The SSE intrinsics work with visual c++, GCC and the intel compiler. There is no problem to use them these days.
Note that you should always keep a version of your code that does not use SSE and constantly check it against your SSE implementation.
This helps not only for debugging, it is also usefull if you want to support CPUs or architectures that don't support your required SSE versions.

In answer to your comment:
So effectively, as long as I don't try to actually execute code containing unsupported instructions I'm fine, and I could get away with an "if(see2Supported){...}else{...}" type switch?
Depends. It's fine for SSE instructions to exist in the binary as long as they're not executed. The CPU has no problem with that.
However, if you enable SSE support in the compiler, it will most likely swap a number of "normal" instructions for their SSE equivalents (scalar floating-point ops, for example), so even chunks of your regular non-SSE code will blow up on a CPU that doesn't support it.
So what you'll have to do is most likely compile on or two files separately, with SSE enabled, and let them contain all your SSE routines. Then link that with the rest of the app, which is compiled without SSE support.

Rather than hand-coding an alternative SSE implementation to your scalar code, I strongly suggest you have a look at OpenCL. It is a vendor-neutral portable, cross-platform system for computationally intensive applications (and is highly buzzword-compliant!). You can write your algorithm in a subset of C99 designed for vectorised operations, which is much easier than hand-coding SSE. And best of all, OpenCL will generate the best implementation at runtime, to execute either on the GPU or on the CPU. So basically you get the SSE code written for you.
Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
Your application sounds like just the kind of problem that OpenCL is designed to address. Writing alternative functions in SSE would certainly improve the execution speed, but it is a great deal of work to write and debug.
Is there a compiler and OS independent way writing the code to take advantage of SSE instructions? I like the VC++ intrinsics, which include SSE operations, but I haven't found any cross compiler solutions.
Yes. The SSE intrinsics have been essentially standardised by Intel, so the same functions work the same between Windows, Linux and Mac (specifically with Visual C++ and GNU g++).
I still need to support some CPU's that either have no or limited SSE support (eg Intel Celeron). Is there some way to avoid having to make different versions of the program, like having some kind of "run time linker" that links in either the basic or SSE optimised code based on the CPU running it when the process is started?
You could do that (eg. using dlopen()) but it is a very complex solution. Much simpler would be (in C) to define a function interface and call the appropriate version of the optimised function via function pointer, or in C++ to use different implementation classes, depending on the CPU detected.
With OpenCL it is not necessary to do this, as the code is generated at runtime for the given architecture.
What about other CPU extensions, looking at the instruction sets of various Intel and AMD CPU's shows there are a few of them?
Within the SSE instruction set, there are many flavours. It can be quite difficult to code the same algorithm in different subsets of SSE when certain instructions are not present. I suggest (at least to begin with) that you choose a minimum supported level, such as SSE2, and fall back to the scalar implementation on older machines.
This is also an ideal situation for unit/regression testing, which is very important to ensure your different implementations produce the same results. Have a test suite of input data and known good output data, and run the same data through both versions of the processing function. You may need to have a precision test for passing (ie. the difference epsilon between the result and the correct answer is below 1e6, for example). This will greatly aid in debugging, and if you build in high-resolution timing to your testing framework, you can compare the performance improvements at the same time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js