I have been reading about heterogeneous computing and came across SPIR-V. There I found the following:
SPIR-V is the first open standard, cross-API intermediate language for natively representing parallel compute and graphics..
From this image I can see that all the high-level languages such as GLSL, HLSL, OpenCL C, etc., are compiled into SPIR-V and in this way passed to the correct physical device to be executed.
My question is why do we need to compile our shader/kernel code to SPIR-V and not compile it directly into machine instructions that will be executed by the chosen physical device? In case this question is not correct, can you please explain why do we need SPIR-V at all?
In general you can split a compiler into two parts: a front-end for a specific language (or family of languages), and a back-end, which is language-agnostic and can generate machine code for one or more specific architectures (you can break this down further, but that's enough for now). Optimizations can happen in both parts; some are more appropriate in either place. This is the relationship between clang and LLVM for example: clang is a front-end for C-family languages, and LLVM is the backend.
Because different GPUs have significantly different machine code (often much more different than, say, arm64 vs. x86_64), the backend compiler needs to be in the GPU driver. But there's no reason to have the front-end be there too, even though that's how it worked in OpenGL. By separating the two, and using SPIR-V as the language they use to communicate, we get:
One parsing and syntax checking implementation, instead of one per vendor. This means developers get to target just one variant of the language, instead of a bunch of vendor-specific variants (due to implementing different versions, bugs, differences in interpretation, etc.)
Support for multiple languages. You can use ESSL (OpenGL ES's variant of GLSL), GLSL, HLSL, and OpenCL-C to write Vulkan shaders, making it easier for developers to support multiple APIs. All emit SPIR-V, so drivers don't have to support each of these languages. In theory someone could design their own language, or support MetalSL, etc.
Since SPIR-V is meant to be machine-written / machine-read instead of human-friendly, it is a simpler and more regular language than GLSL. So it should be easier to get all vendors to implement it with high-quality. (At the moment, implementations are a lot less mature than in GL drivers, so we're not quite there yet.)
Some expensive optimizations can be done offline, e.g. as part of the app build process, instead of at runtime when you're trying to finish a frame in 16 or 33 milliseconds.
Related
Apologies for the slightly jokey title, but I couldn't find another way to concisely describe the question. I work in a team that use predominantly OpenCL code with a CPU fallback. For the most part this works fine, except when it comes to Nvidia and their refusal to use SPIR-V for OpenCL.
I recently found and have been looking into SYCL, but the ecosystem surrounding it is more than a little bit confusing, and in one case I found one implementation referring to using another implementation.
So my question is: is there a single SYCL implementation that can produce a single binary that has runtime support for Nvidia, AMD and Intel (preferred, but not required) and either x64 or Arm64 (we would create a second binary for the other one) without having to do what we do now which is select a bunch of GPUs from the various vendors build the kernels for each one separately and then have to ship them all.
Thanks
As of December 2022, for Linux and x86_64:
The open-source version of DPC++ can compile code for all three GPU vendors. In my experience, a single binary for all three vendors works.
hipSYCL has official support for NVIDIA and AMD devices, and experimental support for Intel GPUs (via the above-mentioned DPC++).
without having to do what we do now which is select a bunch of GPUs from the various vendors build the kernels for each one separately and then have to ship them all.
Note: Under the hood, both hipSYCL and DPC++ work this way. The kernels are compiled to PTX, GCN, and/or SPIR-V. They are bundled into a single binary, though, so, in this respect, the distribution can be simpler (or not: you will likely have to also ship the SYCL runtime libraries with your application).
While AMD is following the OpenGL specification very strict, nVidia often works even when the specification is not followed. One example is that nVidia supports element incides (used in glDrawElements) on the CPU memory, whereas AMD only supports element indices from a element array buffer.
My question is: Is there a way to enforce strict OpenGL behaviour using a nVidia driver? Currently I'm interested in a solution for a Windows/OpenGL 3.2/FreeGlut/GLEW setup.
Edit: If it is not possible to enforce strict behaviour on the driver itself - is there some OpenGL proxy that guarantees strict behaviour (such as GLIntercept)
No vendor enforces the specification strictly. Be it AMD, nVidia, Intel, PowerVR, ... they all have their idiosyncrasies and you have to learn to live with them, sadly. That is one of the annoying things about having each vendor implement their own GLSL compiler, as opposed to Microsoft implementing the one and only HLSL compiler in D3D.
The ANGLE project tries to mitigate this to a certain extent by providing a single shader validator shared across many of the major web browsers, but it is an uphill battle and this only applies to WebGL for the most part. You will always have implementation differences when every vendor implements the entire API themselves.
Now that Khronos group has seriously taken on the task of establishing a set of conformance tests for desktop OpenGL like they have for WebGL / OpenGL ES, things might start to get a little bit better. But forcing a driver to operate in a strict conformance mode is not really a standard thing - there may be #pragmas and such that hint the compiler to behave more strictly, but these are all vendor specific.
By the way, I realize this question has nothing to do with GLSL per-se, but it was the best example I could give.
Unfortunately, the only way you can be sure that your OpenGL code will work on your target hardware is to test it. In theory simply writing standard compliant code should work everywhere, but sadly this isn't always the case.
Can anyone give me a good explanation as to the nature of CUDA C and C++? As I understand it, CUDA is supposed to be C with NVIDIA's GPU libraries. As of right now CUDA C supports some C++ features but not others.
What is NVIDIA's plan? Are they going to build upon C and add their own libraries (e.g. Thrust vs. STL) that parallel those of C++? Are they eventually going to support all of C++? Is it bad to use C++ headers in a .cu file?
CUDA C is a programming language with C syntax. Conceptually it is quite different from C.
The problem it is trying to solve is coding multiple (similar) instruction streams for multiple processors.
CUDA offers more than Single Instruction Multiple Data (SIMD) vector processing, but data streams >> instruction streams, or there is much less benefit.
CUDA gives some mechanisms to do that, and hides some of the complexity.
CUDA is not optimised for multiple diverse instruction streams like a multi-core x86.
CUDA is not limited to a single instruction stream like x86 vector instructions, or limited to specific data types like x86 vector instructions.
CUDA supports 'loops' which can be executed in parallel. This is its most critical feature. The CUDA system will partition the execution of 'loops', and run the 'loop' body simultaneously across an array of identical processors, while providing some of the illusion of a normal sequential loop (specifically CUDA manages the loop "index"). The developer needs to be aware of the GPU machine structure to write 'loops' effectively, but almost all of the management is handled by the CUDA run-time. The effect is hundreds (or even thousands) of 'loops' complete in the same time as one 'loop'.
CUDA supports what looks like if branches. Only processors running code which match the if test can be active, so a subset of processors will be active for each 'branch' of the if test. As an example this if... else if ... else ..., has three branches. Each processor will execute only one branch, and be 're-synched' ready to move on with the rest of the processors when the if is complete. It may be that some of the branch conditions are not matched by any processor. So there is no need to execute that branch (for that example, three branches is the worst case). Then only one or two branches are executed sequentially, completing the whole if more quickly.
There is no 'magic'. The programmer must be aware that the code will be run on a CUDA device, and write code consciously for it.
CUDA does not take old C/C++ code and auto-magically run the computation across an array of processors. CUDA can compile and run ordinary C and much of C++ sequentially, but there is very little (nothing?) to be gained by that because it will run sequentially, and more slowly than a modern CPU. This means the code in some libraries is not (yet) a good match with CUDA capabilities. A CUDA program could operate on multi-kByte bit-vectors simultaneously. CUDA isn't able to auto-magically convert existing sequential C/C++ library code into something which would do that.
CUDA does provides a relatively straightforward way to write code, using familiar C/C++ syntax, adds a few extra concepts, and generates code which will run across an array of processors. It has the potential to give much more than 10x speedup vs e.g. multi-core x86.
Edit - Plans: I do not work for NVIDIA
For the very best performance CUDA wants information at compile time.
So template mechanisms are the most useful because it gives the developer a way to say things at compile time, which the CUDA compiler could use. As a simple example, if a matrix is defined (instantiated) at compile time to be 2D and 4 x 8, then the CUDA compiler can work with that to organise the program across the processors. If that size is dynamic, and changes while the program is running, it is much harder for the compiler or run-time system to do a very efficient job.
EDIT:
CUDA has class and function templates.
I apologise if people read this as saying CUDA does not. I agree I was not clear.
I believe the CUDA GPU-side implementation of templates is not complete w.r.t. C++.
User harrism has commented that my answer is misleading. harrism works for NVIDIA, so I will wait for advice. Hopefully this is already clearer.
The hardest stuff to do efficiently across multiple processors is dynamic branching down many alternate paths because that effectively serialises the code; in the worst case only one processor can execute at a time, which wastes the benefit of a GPU. So virtual functions seem to be very hard to do well.
There are some very smart whole-program-analysis tools which can deduce much more type information than the developer might understand. Existing tools might deduce enough to eliminate virtual functions, and hence move analysis of branching to compile time. There are also techniques for instrumenting program execution which feeds directly back into recompilation of programs which might reach better branching decisions.
AFAIK (modulo feedback) the CUDA compiler is not yet state-of-the-art in these areas.
(IMHO it is worth a few days for anyone interested, with a CUDA or OpenCL-capable system, to investigate them, and do some experiments. I also think, for people interested in these areas, it is well worth the effort to experiment with Haskell, and have a look at Data Parallel Haskell)
CUDA is a platform (architecture, programming model, assembly virtual machine, compilation tools, etc.), not just a single programming language. CUDA C is just one of a number of language systems built on this platform (CUDA C, C++, CUDA Fortran, PyCUDA, are others.)
CUDA C++
Currently CUDA C++ supports the subset of C++ described in Appendix D ("C/C++ Language Support") of the CUDA C Programming Guide.
To name a few:
Classes
__device__ member functions (including constructors and destructors)
Inheritance / derived classes
virtual functions
class and function templates
operators and overloading
functor classes
Edit: As of CUDA 7.0, CUDA C++ includes support for most language features of the C++11 standard in __device__ code (code that runs on the GPU), including auto, lambda expressions, range-based for loops, initializer lists, static assert, and more.
Examples and specific limitations are also detailed in the same appendix linked above. As a very mature example of C++ usage with CUDA, I recommend checking out Thrust.
Future Plans
(Disclosure: I work for NVIDIA.)
I can't be explicit about future releases and timing, but I can illustrate the trend that almost every release of CUDA has added additional language features to get CUDA C++ support to its current (In my opinion very useful) state. We plan to continue this trend in improving support for C++, but naturally we prioritize features that are useful and performant on a massively parallel computational architecture (GPU).
Not realized by many, CUDA is actually two new programming languages, both derived from C++. One is for writing code that runs on GPUs and is a subset of C++. Its function is similar to HLSL (DirectX) or Cg (OpenGL) but with more features and compatibility with C++. Various GPGPU/SIMT/performance-related concerns apply to it that I need not mention. The other is the so-called "Runtime API," which is hardly an "API" in the traditional sense. The Runtime API is used to write code that runs on the host CPU. It is a superset of C++ and makes it much easier to link to and launch GPU code. It requires the NVCC pre-compiler which then calls the platform's C++ compiler. By contrast, the Driver API (and OpenCL) is a pure, standard C library, and is much more verbose to use (while offering few additional features).
Creating a new host-side programming language was a bold move on NVIDIA's part. It makes getting started with CUDA easier and writing code more elegant. However, truly brilliant was not marketing it as a new language.
Sometimes you hear that CUDA would be C and C++, but I don't think it is, for the simple reason that this impossible. To cite from their programming guide:
For the host code, nvcc supports whatever part of the C++ ISO/IEC
14882:2003 specification the host c++ compiler supports.
For the device code, nvcc supports the features illustrated in Section
D.1 with some restrictions described in Section D.2; it does not
support run time type information (RTTI), exception handling, and the
C++ Standard Library.
As I can see, it only refers to C++, and only supports C where this happens to be in the intersection of C and C++. So better think of it as C++ with extensions for the device part rather than C. That avoids you a lot of headaches if you are used to C.
What is NVIDIA's plan?
I believe the general trend is that CUDA and OpenCL are regarded as too low level techniques for many applications. Right now, Nvidia is investing heavily into OpenACC which could roughly be described as OpenMP for GPUs. It follows a declarative approach and tackles the problem of GPU parallelization at a much higher level. So that is my totally subjective impression of what Nvidia's plan is.
I've asked the question before what language should I learn for embedded development. Most embedded engineers said c and c++ are a must, but also pointed out that it depends on the chip.
Can someone clarify? Is it a compiler issue or what? Do chips come with their own specific compilers (like a c compiler or c++ compiler) and that's why you have to use the language the compiler knows? Is it not possible to code and compile it elsewhere, then burn it to the chip directly in its compiled state? (I think I heard an acquaintance say something to this effect)
I'm not sure how this works, as clearly I don't know much embedded systems or how they work. It's probably an easy answer for those of you who know.
Probably, they meant some toolchains do not support C++. Yes, many chips and boards do come with their own toolchains. Different processors have different instruction sets, which means a different compiler (or more specifically a different backend). That doesn't mean you always have to relearn everything. Many of these are based on GCC (often considered the most ported compiler). The final executable/image formats also vary, so you need a specific linker. Most likely, you will be (cross-)compiling the chip on a "regular" computer, then burning it to the chip. However, that doesn't mean you can use a typical compiler and linker targeted towards a desktop operating system.
It "depends on the chip" in three possible ways:
Some very constrained architectures are not suited to C++, or at least C++ provides constructs not suited to such architectures so offers no benefit over C. Most 8 bit devices fall into this category, but by no means all; I have seen useful C++ code implemented on MegaAVR for example.
Some devices are not supported by a C++ compiler. For example Microchip's dsPIC/PIC24 compiler is C only (third-party tools may have C++ support).
The chip architecture is designed specifically for a particular language; for example INMOS Transputers invariably ran OCCAM.
As well as C, C++, other possibilities are assembler, Forth, Ada, Pascal and many others, but C is almost ubiquitous; few chip vendors will release a new architecture or device without a C compiler being available from day-one. For other languages you will generally have to wait until a third-part decides to develop one, and that wait may be forever for a niche architecture.
Is it not possible to code and compile it elsewhere, then burn it to the chip directly in its compiled state?
That is called cross-compilation or cross-development, and is the usual development method for embedded systems. Most embedded systems lack the OS, file, performance and memory resources to self-host a compiler, and most developers want the comfort of a sophisticated development environment with IDEs, debuggers etc. in a familiar user-oriented desktop OS.
I'm not sure how this works, as
clearly I don't know much embedded
systems or how they work.
Get up-to-speed with some of these:
http://www.state-machine.com/arm/Building_bare-metal_ARM_with_GNU.pdf
http://www.eetimes.com/design/embedded
http://www.amazon.com/exec/obidos/ASIN/020179523X
http://www.amazon.com/Embedded-Systems-Firmware-Demystified-CD-ROM/dp/1578200997
Yes, there are many architectures for which a C compiler exists but a C++ compiler does not. The smaller and less fully-featured a processor you choose, the more likely this situation is to occur.
For embedded development, you almost always compile the code 'elsewhere', as you say, and then send it to the chip for execution/debugging. The process of compiling code for a different architecture than the compiler itself is built for is called 'cross-compiling'.
You are correct: chips have variations on compilers. Most/many modern chips have a gcc port; but not all.
The term 'embedded' is used to describe a vast range of hardware. Most embedded software engineering will consist of writing C/C++ code to produce a binary for a target microprocessor, but there are devices that you may work with that are not coded with compiled binary.
One example is a Programmable Logic Controller (PLC). These devices use a language called "Ladder Logic". It's a wonderful language. I have enjoyed working with it in the past.
Another thing you may encounter, as I have in the past, is devices that have interpreted BASIC emulators. Hopefully that is rare today.
C/C++ are a very good choice for firmware development. So the software you make will run on a embedded CPU/Microcontroller. In order to proper programmer the device, you will need to know the language and the device architecture.
The same code probably will not work in different devices. So, you have to learn the language, and the device architecture.
Another options are FPGAs, which are not microcontroller. FPGA are devices with specialized cell capable to transform itself in any type of synchronous circuit, including microcontroller. FPGAs are programed with Hardware Description languages, like verilog and VHDL. The "compiled" (synthesized) version of the software are called gateware.
The HDLs are the same languages used for ASICs designe also. The path to properly learn
the language are long. So I recommend start with C/C++ with pic form Microchip, which is a
low cost and highly accepted microcontroller.
If you intend to do FPGA development, the knowledge gained with C/C++/pic will be helpfull and important, because must FPGAs have embedded CPU/Microcontroller inside.
There is no direct scientific reason for it. In a lot of cases it has to do with the management and politics of the specific company.
Some companies are driven to create a turn key system and force you to buy that system and pay for maintenance. It locks out the individual developers, but there are many companies and esp government agencies that prefer this model because the support is often much better and you can often drive the direction of their products to suit your needs.
Other companies do not have the staff or the talent and outsource the solution and sometimes take whatever they can get. And you might end up with a one time developed tool that after the contractor leaves is never updated or fixed again, or if it is fixed it is a patch job by someone else. It takes money to make money, but if you run out of money before you can sell your product you still fail.
Sometimes you have companies that both have a staff that maintains their in-house must buy from them tool AND has individuals that also contribute to open tools like gcc.
Sometimes the politics or management in the company have individuals that have a strong opinion of how the world must be and only allow tools to be developed for a specific language. Or perhaps they are owned by or partner with or just like a company that has a specific language and this chip product came to be simply to support that language.
On top of all of this you have the very real technical problems of memory space, the quality and efficiency of the instruction set and how compiler friendly it is. Some architectures may be fine for assembler, but higher level compiled code chews up the limited memory resources too quickly.
Gcc in particular has a lot of problems internally (not as a people but the software/source code itself). I challenge you to write a back end, even with the tutorials that are out there. A company requires specialised talent in order to create and then maintain a gcc backend year after year, otherwise you get dumped. if your chip architecture is not 32 bit or bigger you are already fighting a losing battle with gcc, your chip architecture might be compiler friendly but just not friendly with the popular compilers design.
In the near future llvm is going to shine as a cross compiler relative to gcc because it has not yet built this internal bulk, and perhaps because the internal guts are themselves a defined language/system it may never suffer what has happened to gcc. As more folks get comfortable with llvm we will see a number of architectures ported to it. The msp430 backend was done specifically to demonstrate that you can add a target literally in an afternoon. By the end of next month, some motivated individual could have all of the targets most of us have ever heard of ported to llvm. And you dont have to build a cross compiler it is always a cross compiler. I only mention llvm because the door is now open for targets that have suffered from bad tools to recover.
Some companies, microcontrollers in particular, can and will make the programming interface proprietary so that you must use their programming tool (and or hack it and take your chances with publishing those results and or a cat and mouse of them changing it to defeat you). And they may have only made tools for Windows leaving the linux and apple folks hanging in the wind. Or they make it so that the only binaries it will load are the ones generated by their tools, here again you may hack through the binary format allowing an alternate compiler, and they may or may not work to defeat you.
Despite the technical problems the biggest is the companies politics, management, marketing teams, and supply of or lack of talent in the engineering staff. The bottom line, follow the dollars not the technology or science to understand why this language is supported and not that, or the support for this language is good, bad, or marginal.
What language to learn as a result of all of this? Start with assembler on at least three different architectures. Then C and then C++ if you feel you really need it. C and assembler are your primary languages for embedded (depending on your definition of embedded). No, we write assembler mostly for initial boot code and to support C, interrupt stuff or special instructions that are needed that the compiler cannot create. There are places like microcontrollers where it may very well make sense to use assembler for various reasons like tools, limited chip resources, etc. Even if you dont use assembler knowing it makes you a much better high level programmer.
You do need to decide what your definition of embedded is. Is it api and library calls for an application on a(n embedded) linux system (indistinguishable from the same program/calls on a desktop system). Or at the other end of the spectrum are you talking a microcontroller with maybe 256 or 1024 bytes (not mega or giga, but bytes) of program space? Or something in the middle? The majority of the "embedded" folks out there are closer to the api calls for applications on an operating system (rtos, linux, wince, etc), than the deeply embedded, so that means C, maybe C++ (always be able to fall back on C), trying to avoid python and other scripty languages that are resource hogs.
Some 8-bit parts cannot efficiently access data from a stack. Instead of using a stack to pass parameters, auto-variables and parameters are statically allocated; typically, a linker allocates the automatic variables for main() at one end of memory, and then allocate the variables for functions that are called by main and nothing else, then allocate the variables for functions that are called by those functions and nothing else, etc. This will yield an optimal allocation fairly easily, subject to some caveats:
Recursion can only be supported by adding code to explicitly copy variables onto some sort of stack arrangement; in many compilers, it's simply not supported at all.
If a function looks like it "might" call another function, the linker will assume it can do so in all cases (e.g. it may be that when 'foo' calls 'bar', one of its parameters might always have a value such that 'bar' won't call 'boz', but the linker won't know that).
Any call to a function pointer with a certain signature will be regarded as a call to all functions with the same signature whose address is taken.
If the evaluation of more than one parameter to a function requires making additional function calls, additional temporary storage must generally be pessimistically allocated even if optimal placement of the parameter storage could have avoided that.
There are many types of C programs for which the above restrictions pose no problem at all, and many more for which they pose a nuisance but not a huge one (e.g. by adding dummy parameters or return values to ensure different classes of indirectly-called functions have different signatures). Unfortunately, the code generated by an C++ to C pre-compiler will almost always involve function pointers whose call graph cannot be reasonably divined, so using C++ on such a platform is apt to be difficult if not impossible.
In relation to this question on Using OpenGL extensions, what's the purpose of these extension functions? Why would I want to use them? Further, are there any tradeoffs or gotchas associated with using them?
The OpenGL standard allows individual vendors to provide additional functionality through extensions as new technology is created. Extensions may introduce new functions and new constants, and may relax or remove restrictions on existing OpenGL functions.
Each vendor has an alphabetic abbreviation that is used in naming their new functions and constants. For example, NVIDIA's abbreviation (NV) is used in defining their proprietary function glCombinerParameterfvNV() and their constant GL_NORMAL_MAP_NV.
It may happen that more than one vendor agrees to implement the same extended functionality. In that case, the abbreviation EXT is used. It may further happen that the Architecture Review Board "blesses" the extension. It then becomes known as a standard extension, and the abbreviation ARB is used. The first ARB extension was GL_ARB_multitexture, introduced in version 1.2.1. Following the official extension promotion path, multitexturing is no longer an optionally implemented ARB extension, but has been a part of the OpenGL core API since version 1.3.
Before using an extension a program must first determine its availability, and then obtain pointers to any new functions the extension defines. The mechanism for doing this is platform-specific and libraries such as GLEW and GLEE exist to simplify the process.
Extensions are, in general, a way for graphics card vendors to add new functionality to OpenGL without having to wait until the next revision of the OpenGL spec. There are different types of extensions:
Vendor extension - only one vendor provides a certain type of functionality.
Example: NV_vertex_program
Multivendor extension - multiple vendors have gotten together and agreed on the functionality.
Example: EXT_vertex_program
ARB extension - the OpenGL Architecture Review Board has blessed the extension. You have a reasonable expectation that this type of extension will be around for a while.
Example: ARB_vertex_program
Extensions don't have to go through all of these steps. Sometimes an extension is only ever implemented by one vendor, before hardware designs go a different way and the extension is abandoned. Other times, an extension might make it as far as ARB status before everyone decides there's a better way. (The ARB_vertex_program approach, for instance, was set aside in favor of the high-level shading language approach of ARB_vertex_shader when it came time to roll shaders into the core OpenGL spec.) Even ARB extensions don't last forever; I wouldn't write something today requiring ARB_matrix_palette, for instance.
All of that having been said, it's a very good idea to keep up to date on extensions, in particular the latest ARB and EXT extensions. In the past it has been true that some of the 'fast paths' through the hardware were only accessible via extensions. Likewise, if you want to know what all functionality a piece of hardware can do, there's no better place to look than in a vendor-specific extension.
If you're just getting started in OpenGL, I'd recommend investigating:
ARB_vertex_buffer_object (vertices)
ARB_vertex_shader / ARB_fragment_shader / ARB_shader_objects / GLSL spec (shaders)
More advanced:
ARB/EXT_framebuffer_object (off-screen rendering)
This is all functionality that's been rolled into core, but it can be good to see it in isolation so you can get a better feel for where its boundaries lie. (The core OpenGL spec seamlessly mixes the old with the new, so this can be pretty important if you want to stay on the fast path, and avoid the legacy and sometimes implemented in software paths.)
Whatever you do, make sure you have appropriate checks for the extensions you decide to use, and fallbacks where necessary. Even though your card may have a given extension, there's no guarantee that the extension will be present on another vendor's card, or even on another operating system with the same card.
OpenGL Extensions are new features added to the OpenGL specification, they are added by the OpenGL standards body and by the various graphics card vendors. These are exposed to the programmer as new function calls or variables. Every new version of the OpenGL specification ships with newer functionality and (typically) includes all the previous functionality and extensions.
The real problem with OpenGL extensions exists only on Windows. Microsoft hasn't supported any extensions that have been released after OpenGL v1.1. The graphics card vendors overcome this by shipping their own version of this functionality through header files and libraries. However, using this can be a bit painful as the question you linked to shows. But this problem has mostly gone away with the popularity of GLEW, which takes care of wrapping all this into a easy-to-use package.
If you do use a very recent OpenGL extension, be aware that it may not be supported on older graphics hardware. Other than this, there's no other disadvantage to using these extensions. Most of the extensions which become standard are pretty darn useful and there's very little logic to not use them.