How expensive are OpenGL operations? - opengl

I'm curious how expensive functions like:
glViewPort
glLoadIdentity
glOrtho
are in terms of both the work done on the CPU and the work done on the GPU.
Where is this documented?

This kind of thing is probably pretty dependent on your platform. Your best bet is probably to use a profiler yourself if you're worried about it.

As Alex O'Konski mentions, this is highly dependent on the platform.
That said, if you're interested in recent graphics cards of the PCs, You should know that most of them don't "do work" on the GPU. they set up state for future draw calls.
This is important because their cost is more related with how well the GPU can pipeline them between various draw calls that flow through the chip than how much time it takes to change a register from one value to the next.
Most platform vendors do not document at all what the costs of various state changes are. They don't document how OpenGL state maps to their hardware state, for that matter.
Last, state changes like matrix state (glLoadIdentity and glOrtho) are a remnant of the past. In modern graphics cards, they are simply helper (CPU) functions to set up uniforms (and this is why they are deprecated in core GL 3.1). And all the math they require (usually not much) is done on the CPU, inside the driver.

Related

Performance of WebGL and OpenGL

For the past month I've been messing with WebGL, and found that if I create and draw a large vertex buffer it causes low FPS. Does anyone know if it be the same if I used OpenGL with C++?
Is that a bottleneck with the language used (JavaScript in the case of WebGL) or the GPU?
WebGL examples like this show that you can draw 150,000 cubes using one buffer with good performance but anything more than this, I get FPS drops. Would that be the same with OpenGL, or would it be able to handle a larger buffer?
Basically, I've got to make a decision to continue using WebGL and try to optimise by code or - if you tell me OpenGL would perform better and it's a language speed bottleneck, switch to C++ and use OpenGL.
If you only have a single drawArrays call, there should not be much of a difference between OpenGL and WebGL for the call itself. However, setting up the data in Javascript might be a lot slower, so it really depends on your problem. If the bulk of your data is static (landscape, rooms), WebGL might work well for you. Otherwise, setting up the data in JS might be too slow for your purpose. It really depends on your problem.
p.s. If you include more details of what you are trying to do, you'll probably get more detailed / specific answers.
Anecdotally, I wrote a tile-based game in the early 2000's using the old glVertex() style API that ran perfectly smoothly. I recently started port it to WebGL and glDrawArrays() and now on my modern PC that is at least 10 times faster it gets terrible performance.
The reason seems to be that I was faking a call go glBegin(GL_QUADS); glVertex()*4; glEnd(); by using glDrawArrays(). Using glDrawArrays() to draw one polygon is much much slower in WebGL than doing the same with glVertex() was in C++.
I don't know why this is. Maybe it is because javascript is dog slow. Maybe it is because of some context switching issues in javascript. Anyway I can only do around 500 one-polygon glDrawArray() calls while still getting 60 FPS.
Everybody seems to work around this by doing as much on the GPU as possible, and doing as few glDrawArray() calls per frame as possible. Whether you can do this depends on what you are trying to draw. In the example of cubes you linked they can do everything on the GPU, including moving the cubes, which is why it is fast. Essentially they cheated - typically WebGL apps won't be like that.
Google had a talk where they explained this technique (they also unrealistically calculate the object motion on the GPU): https://www.youtube.com/watch?v=rfQ8rKGTVlg
OpenGL is more flexible and more optimized because of the newer versions of the api used.
It is true if you say that OpenGL is faster and more capable, but it also depends on your needs.
If you need one cube mesh with texture, webGL would be sufficient. However, if you intend building large-scale projects with lots of vertices, post-processing effects and different rendering techniques (and kind of displacement, parallax mapping, per-vertex or maybe tessellation) then OpenGL might be a better and wiser choice actually.
Optimizing buffers to a single call, optimizing update of those can be done, but it has its limits, of course, and yes, OpenGL would most likely perform better anyway.
To answer, it is not a language bottleneck, but an api-version-used one.
WebGL is based upon OpenGL ES, which has some pros but also runs a bit slower and it has more abstraction levels for code handling than pure OpenGL has, and that is reason for lowering performance - more code needs to be evaluated.
If your project doesn't require web-based solution, and doesn't care which devices are supported, then OpenGL would be a better and smarter choice.
Hope this helps.
WebGL is much slower on the same hardware compared to equivalent OpenGL, because of the high overheard for each WebGL call.
On desktop OpenGL, this overhead is at least limited, if still relatively expensive.
But in browsers like Chrome, WebGL requires that not only do we cross the FFI barrier to access those native OpenGL API calls (which still incur the same overhead), but we also have the cost of security checks to prevent the GPU being hijacked for computation.
If you are looking at something like glDraw* calls, which are called per frame, this means we are talking about perhaps (an) order(s) of magnitude fewer calls. All the more reason to opt for something like instancing, where the number of calls is drastically reduced.

OpenGL vs. OpenCL, which to choose and why?

What features make OpenCL unique to choose over OpenGL with GLSL for calculations? Despite the graphic related terminology and inpractical datatypes, is there any real caveat to OpenGL?
For example, parallel function evaluation can be done by rendering a to a texture using other textures. Reducing operations can be done by iteratively render to smaller and smaller textures. On the other hand, random write access is not possible in any efficient manner (the only way to do is rendering triangles by texture driven vertex data). Is this possible with OpenCL? What else is possible not possible with OpenGL?
OpenCL is created specifically for computing. When you do scientific computing using OpenGL you always have to think about how to map your computing problem to the graphics context (i.e. talk in terms of textures and geometric primitives like triangles etc.) in order to get your computation going.
In OpenCL you just formulate you computation with a calculation kernel on a memory buffer and you are good to go. This is actually a BIG win (saying that from a perspective of having thought through and implemented both variants).
The memory access patterns are though the same (your calculation still is happening on a GPU - but GPUs are getting more and more flexible these days).
But what else would you expect than using more than a dozen parallel "CPUs" without breaking your head about how to translate - e.g. (silly example) Fourier to Triangles and Quads...?
Something that hasn't been mentioned in any answers so far has been speed of execution. If your algorithm can be expressed in OpenGL graphics (e.g. no scattered writes, no local memory, no workgroups, etc.) it will very often run faster than an OpenCL counterpart. My specific experience of this has been doing image filter (gather) kernels across AMD, nVidia, IMG and Qualcomm GPUs. The OpenGL implementations invariably run faster even after hardcore OpenCL kernel optimization. (aside: I suspect this is due to years of hardware and drivers being specifically tuned to graphics orientated workloads.)
My advice would be that if your compute program feels like it maps nicely to the graphics domain then use OpenGL. If not, OpenCL is more general and simpler to express compute problems.
Another point to mention (or to ask) is whether you are writing as a hobbyist (i.e. for yourself) or commercially (i.e. for distribution to others). While OpenGL is supported pretty much everywhere, OpenCL is totally lacking support on mobile devices and, imho, is highly unlikely to appear on Android or iOS in the next few years. If wide cross platform compatibility from a single code base is a goal then OpenGL may be forced upon you.
What features make OpenCL unique to choose over OpenGL with GLSL for calculations? Despite the graphic related terminology and inpractical datatypes, is there any real caveat to OpenGL?
Yes: it's a graphics API. Therefore, everything you do in it has to be formulated along those terms. You have to package your data as some form of "rendering". You have to figure out how to deal with your data in terms of attributes, uniform buffers, and textures.
With OpenGL 4.3 and OpenGL ES 3.1 compute shaders, things become a bit more muddled. A compute shader is able to access memory via SSBOs/Image Load/Store in similar ways to OpenCL compute operations (though OpenCL offers actual pointers, while GLSL does not). Their interop with OpenGL is also much faster than OpenCL/GL interop.
Even so, compute shaders do not change one fact: OpenCL compute operations operate at a very different precision than OpenGL's compute shaders. GLSL's floating-point precision requirements are not very strict, and OpenGL ES's are even less strict. So if floating-point accuracy is important to your calculations, OpenGL will not be the most effective way of computing what you need to compute.
Also, OpenGL compute shaders require 4.x-capable hardware, while OpenCL can run on much more inferior hardware.
Furthermore, if you're doing compute by co-opting the rendering pipeline, OpenGL drivers will still assume that you're doing rendering. So it's going to make optimization decisions based on that assumption. It will optimize the assignment of shader resources assuming you're drawing a picture.
For example, if you're rendering to a floating-point framebuffer, the driver might just decide to give you an R11_G11_B10 framebuffer, because it detects that you aren't doing anything with the alpha and your algorithm could tolerate the lower precision. If you use image load/store instead of a framebuffer however, you're much less likely to get this effect.
OpenCL is not a graphics API; it's a computation API.
Also, OpenCL just gives you access to more stuff. It gives you access to memory levels that are implicit with regard to GL. Certain memory can be shared between threads, but separate shader instances in GL are unable to directly affect one-another (outside of Image Load/Store, but OpenCL runs on hardware that doesn't have access to that).
OpenGL hides what the hardware is doing behind an abstraction. OpenCL exposes you to almost exactly what's going on.
You can use OpenGL to do arbitrary computations. But you don't want to; not while there's a perfectly viable alternative. Compute in OpenGL lives to service the graphics pipeline.
The only reason to pick OpenGL for any kind of non-rendering compute operation is to support hardware that can't run OpenCL. At the present time, this includes a lot of mobile hardware.
One notable feature would be scattered writes, another would be the absence of "Windows 7 smartness". Windows 7 will, as you probably know, kill the display driver if OpenGL does not flush for 2 seconds or so (don't nail me down on the exact time, but I think it's 2 secs). This may be annoying if you have a lengthy operation.
Also, OpenCL obviously works with a much greater variety of hardware than just the graphics card, and it does not have a rigid graphics-oriented pipeline with "artificial constraints". It is easier (trivial) to run several concurrent command streams too.
Although currently OpenGL would be the better choice for graphics, this is not permanent.
It could be practical for OpenGL to eventually merge as an extension of OpenCL. The two platforms are about 80% the same, but have different syntax quirks, different nomenclature for roughly the same components of the hardware. That means two languages to learn, two APIs to figure out. Graphics driver developers would prefer a merge because they no longer would have to develop for two separate platforms. That leaves more time and resources for driver debugging. ;)
Another thing to consider is that the origins of OpenGL and OpenCL are different: OpenGL began and gained momentum during the early fixed-pipeline-over-a-network days and was slowly appended and deprecated as the technology evolved. OpenCL, in some ways, is an evolution of OpenGL in the sense that OpenGL started being used for numerical processing as the (unplanned) flexibility of GPUs allowed so. "Graphics vs. Computing" is really more of a semantic argument. In both cases you're always trying to map your math operations to hardware with the highest performance possible. There are parts of GPU hardware which vanilla CL won't use but that won't keep a separate extension from doing so.
So how could OpenGL work under CL? Speculatively, triangle rasterizers could be enqueued as a special CL task. Special GLSL functions could be implemented in vanilla OpenCL, then overridden to hardware accelerated instructions by the driver during kernel compilation. Writing a shader in OpenCL, pending the library extensions were supplied, doesn't sound like a painful experience at all.
To call one to have more features than the other doesn't make much sense as they're both gaining 80% the same features, just under different nomenclature. To claim that OpenCL is not good for graphics because it is designed for computing doesn't make sense because graphics processing is computing.
Another major reason is that OpenGL\GLSL are supported only on graphics cards. Although multi-core usage started with using graphics hardware there are many hardware vendors working on multi-core hardware platform targeted for computation. For example see Intels Knights Corner.
Developing code for computation using OpenGL\GLSL will prevent you from using any hardware that is not a graphics card.
Well as of OpenGL 4.5 these are the features OpenCL 2.0 has that OpenGL 4.5 Doesn't (as far as I could tell) (this does not cover the features that OpenGL has that OpenCL doesn't):
Events
Better Atomics
Blocks
Workgroup Functions:
work_group_all and work_group_any
work_group_broadcast:
work_group_reduce
work_group_inclusive/exclusive_scan
Enqueue Kernel from Kernel
Pointers (though if you are executing on the GPU this probably doesn't matter)
A few math functions that OpenGL doesn't have (though you could construct them yourself in OpenGL)
Shared Virtual Memory
(More) Compiler Options for Kernels
Easy to select a particular GPU (or otherwise)
Can run on the CPU when no GPU
More support for those niche hardware platforms (e.g. FGPAs)
On some (all?) platforms you do not need a window (and its context binding) to do calculations.
OpenCL allows just a bit more control over precision of calculations (including some through those compiler options).
A lot of the above are mostly for better CPU - GPU interaction: Events, Shared Virtual Memory, Pointers (although these could potentially benefit other stuff too).
OpenGL has gained the ability to sort things into different areas of Client and Server memory since a lot of the other posts here have been made.
OpenGL has better memory barrier and atomics support now and allows you to allocate things to different registers within the GPU (to about the same degree OpenCL can). For example you can share registers in the local compute group now in OpenGL (using something like the AMD GPUs LDS (local data share) (though this particular feature only works with OpenGL compute shaders at this time).
OpenGL has stronger more performing implementations on some platforms (such as Open Source Linux drivers).
OpenGL has access to more fixed function hardware (like other answers have said). While it is true that sometimes fixed function hardware can be avoided (e.g. Crytek uses a "software" implementation of a depth buffer) fixed function hardware can manage memory just fine (and usually a lot better than someone who isn't working for a GPU hardware company could) and is just vastly superior in most cases. I must admit OpenCL has pretty good fixed function texture support which is one of the major OpenGL fixed function areas.
I would argue that Intels Knights Corner is a x86 GPU that controls itself.
I would also argue that OpenCL 2.0 with its texture functions (which are actually in lesser versions of OpenCL) can be used to much the same performance degree user2746401 suggested.
In addition to the already existing answers, OpenCL/CUDA not only fits more to the computational domain, but also doesn't abstract away the underlying hardware too much. This way you can profit from things like shared memory or coalesced memory access more directly, which would otherwise be burried in the actual implementation of the shader (which itself is nothing more than a special OpenCL/CUDA kernel, if you want).
Though to profit from such things you also need to be a bit more aware of the specific hardware your kernel will run on, but don't try to explicitly take those things into account using a shader (if even completely possible).
Once you do something more complex than simple level 1 BLAS routines, you will surely appreciate the flexibility and genericity of OpenCL/CUDA.
The "feature" that OpenCL is designed for general-purpose computation, while OpenGL is for graphics. You can do anything in GL (it is Turing-complete) but then you are driving in a nail using the handle of the screwdriver as a hammer.
Also, OpenCL can run not just on GPUs, but also on CPUs and various dedicated accelerators.
OpenCL (in 2.0 version) describes heterogeneous computational environment, where every component of system can both produce & consume tasks, generated by other system components. No more CPU, GPU (etc) notions are longer needed - you have just Host & Device(s).
OpenGL, in opposite, has strict division to CPU, which is task producer & GPU, which is task consumer. That's not bad, as less flexibility ensures greater performance. OpenGL is just more narrow-scope instrument.
One thought is to write your program in both and test them with respect to your priorities.
For example: If you're processing a pipeline of images, maybe your implementation in openGL or openCL is faster than the other.
Good luck.

How are Direct3D and OpenGL instructions handled in a graphics card?

I am trying to understand better how GPUs work, and I am confused about how they handled high level APIs like Direct3D or OpenGL. It is very common to see graphic cards advertising they support Direct3D and OpenGL hardware acceleration. Does this mean that they handle Direct3D and OpenGL instructions directly in hardware?
I haven't been able to find clear evidence to this, or to them being compiled to an assembly representation that the GPU can handle. If there is such a conversion who does that? The software library (Direct3D/OpenGL), the driver or the GPU itself?
On that same line, where is the graphics pipeline defined? in the gpu hardware, the driver, or the software library? This confuses me specially with the idea of programmable pipelines.
Is there a good resource where I can find information about these details?
You have asked a very broad and complicated question. Actually, you have asked several broad, complicated questions.
The software that has final governance over the operation of any hardware is called the hardware's "driver". Naturally, for graphics hardware, this is called the "graphics driver." Like all drivers, the graphics driver is effectively an installable part of the OS; the OS is what allows the graphics driver to do its job and talk to the hardware. The two work hand in hand.
There are effectively two kinds of D3D or OpenGL (heretofore known as "the API") calls: those that talk to the driver and those that do not. Every call that actually draws something needs to (eventually) talk to the driver, but calls that set up later drawing calls may just store data locally.
When you make a drawing call, the API does some checks to make sure that you as the user have made a valid rendering call. If so, the API has some options as to what to do. It turns out that talking directly to the driver takes a long time, regardless of how many commands you give it when you start talking. Therefore, what often happens is that the API stores your rendering call and returns immediately. Then, possibly in another thread, it may look to see how many rendering calls have been stored. If there are "enough", then it will forward them to the driver. This is called "marshalling".
The driver's job is to take these calls that have been forwarded and convert them into stuff that the GPU will do.
On that same line, where is the graphics pipeline defined? in the gpu hardware, the driver, or the software library?
That's actually a pretty tricky question these days, and becoming trickier every hardware generation.
In the old days, the construction of the graphics pipeline was rigidly controlled by the GPU hardware. These days, this is less true, though there is some hardware control. On modern hardware (capable of OpenGL 3.0 or Direct3D10 or better), it would be theoretically possible, if you had direct access to the graphics driver, to design an API that used a somewhat altered version of the graphics pipeline. So the APIs dictate much of what the graphics pipeline looks like.
Each stage in the rendering pipeline takes certain values from the precious stage(s) as input and generates some number of values as output. A stage is "programmable" if the mechanism for generating the outputs from the inputs involves executing a user-supplied program, called a "shader". So there is no such thing as a programmable pipeline (yet); just programmable stages of a fixed pipeline.
There's no such thing as D3D or OGL instructions. Direct3D or OpenGL will call into the graphics driver and they will perform whatever they need to do to make it happen. This is not completely true of shaders, which do have a uniform bytecode at the API (D3D/OGL) level, and in this case, the API provides a compiler, but those are, as far as I know, still transformed in hardware-dependent ways before being executed. Of course, Direct3D and OpenGL also include user-mode components to improve performance or provide a better interface- for example, they will batch calls to the kernel to reduce context switches.
The reality of GPU making is that Microsoft and nVidia/ATi get together and think about what they want and what's feasible to implement, and come up with a group specification, as the reality is that none of this would work if the major hardware and software vendors didn't co-operate. Nobody will buy a GPU that doesn't support DirectX- and nobody will buy Windows where no GPU implements DirectX. Of course, "nobody" is relative- but it would be a huge loss for all concerned, and of course, if you have a game that is built to only the D3D10 API, then the driver supporting D3D10 is a must to run the game- effectively increasing the value of the product by increasing the range of software it can run, which is a selling point. This means that the semantic difference between being defined by the hardware vendor or software vendor is minimal, realistically- especially as the only two real 3D rendering API's on the PC, OpenGL and Direct3D, follow very similar models for the graphical pipeline, as far as I know.
However, with the new programmable GPUs, you could argue that the graphical pipeline doesn't really exist- a DX11 device can be used for any graphics pipeline you can conceive of, if you have the patience to program it.
Ultimately, the GPU is protected by a strong driver-level abstraction. It implements a C-style interface, and whatever's permitted or necessary in that implementation goes. Everything after that is completely implementation-defined.
You could check out the MSDN documentation for writing a graphics driver. I've seen it, but don't have a link handy, and it describes the interfaces that you must adhere to and other things.
You already got two very good answers. But maybe the best thing is, reading the actual programming documentation for AMD/ATI's GPUs: http://developer.amd.com/documentation/guides/pages/default.aspx#open_gpu
Unfortunately NVidia won't publish theirs.

Which parts of Graphics Pipelines are done using CPU & GPU?

Which parts of pipelines are done using CPU and which are done using GPU?
Reading Wikipedia on Graphics Pipeline, maybe my question does not precisely represent what I am asking.
Referring to this question, which "steps" are done in CPU and which are done in GPU?
Edit:
My question is more into which parts of logical high level steps needed to display terrain+3D models [from files] are using CPU/GPU instead of which functions.
Which parts of pipelines are done using CPU and which are done using GPU?
As far as I know...
It depends entirely on the hardware, driver and opengl implementation.
Everything you can do in OpenGL, works, but you can't be 100% sure where processing happens - it is conveniently hidden from you (it is actually one of OpenGL benefits).
There are full-software OpenGL emulators, for example - mesa3d (non-certified).
Put such opengl32.dll (if you're on windows) into your program folder, and everything will be running on CPU.
Typically, rasterization (last stage of pipeline - alpha-blend, putting pixel, ztest) is done in hardware. Get old videocard, and it is possible that the rest of functionality will run on CPU. Vertex transformations, vertex manipulations, and even standard lighting can be done on CPU (and they were done on CPU for a long time) - depending on videocard. However, I must admit that that videocards without hardware TL (transform and lighting) support are quite old. AFAIK, transformation and lighting is always performed on CPU if card only supports full hardware acceleration of DirectX 7 (windows-platform specific).
There may be platform dependent way to get videocard capabilities in openGL, though..
Side note:
DirectX gives you more information about card capabilities, and a ways to control what happens where (i.e. you can choose software T&L even if videocard supports hardware acceleration - in some cases this may be necessarry). This comes with penalty - DirectX is less flexible than OpenGL, harder to use, harder to experiment with, it is available on less platforms than OpenGL, and worrying about which feature is supported and which isn't takes development time.
The answer depends on what GPU (since they differ in HW capabilities, specially in integrated ones), but on modern non-integrated GPUs, the entire pipeline is implemented in HW. This includes culling, z-sort and occlusion.
Integrated chipsets, such as Intel's, have less HW features, for example, they usually do tranformation & lighting using software.
It depends a lot on the system configuration, at least the CPU has to issue the commands to the GPU (send vertices lists, upload textures, etc.), and the GPU usually does all the geometric transformations, texture mappings, lighting, etc.
But in many cases, part of the work that the GPU should be doing might be done by the CPU if the hardware doesn't support certain features.
If you're using a 3D engine, it might also do some pre-optimizations on the CPU to alleviate the number of triangles that the GPU has to process, like discarding beforehand parts of the geometry that are known to be hidden (by using BSP's like IdTech engines or Portals like Unreal Engine, for example).
Things like moving and rotating the meshes of the models (e.g. the walking animations) are usually done in the CPU, but nowadays I guess the trend would be to HW accelerate them too.
Regarding the wikipedia link, all the steps mentioned there are usually done in the GPU, but is up to the software running in the CPU to decide what polygons are going to be sent for processing (even if they are discarded by the GPU later on because they're hidden or out of the viewing frustrum).

Which OpenGL functions are not GPU-accelerated?

I was shocked when I read this (from the OpenGL wiki):
glTranslate, glRotate, glScale
Are these hardware accelerated?
No, there are no known GPUs that
execute this. The driver computes the
matrix on the CPU and uploads it to
the GPU.
All the other matrix operations are
done on the CPU as well :
glPushMatrix, glPopMatrix,
glLoadIdentity, glFrustum, glOrtho.
This is the reason why these functions
are considered deprecated in GL 3.0.
You should have your own math library,
build your own matrix, upload your
matrix to the shader.
For a very, very long time I thought most of the OpenGL functions use the GPU to do computation. I'm not sure if this is a common misconception, but after a while of thinking, this makes sense. Old OpenGL functions (2.x and older) are really not suitable for real-world applications, due to too many state switches.
This makes me realise that, possibly, many OpenGL functions do not use the GPU at all.
So, the question is:
Which OpenGL function calls don't use the GPU?
I believe knowing the answer to the above question would help me become a better programmer with OpenGL. Please do share some of your insights.
Edit:
I know this question easily leads to optimisation level. It's good, but it's not the intention of this question.
If anyone knows a set of GL functions on a certain popular implementation (as AshleysBrain suggested, nVidia/ATI, and possibly OS-dependent) that don't use the GPU, that's what I'm after!
Plausible optimisation guides come later. Let's focus on the functions, for this topic.
Edit2:
This topic isn't about how matrix transformations work. There are other topics for that.
Boy, is this a big subject.
First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.
Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.
All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).
Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:
specialized tasks (rasterization, blending, texture filtering, blitting, ...)
heavily parallel workloads (DeadMG is pointing to that in his answer), when a CPU is more designed to handle single-threaded work.
And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.
It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.
Selection is another one where it's unclear whether there is value to executing on the GPU.
Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.
A lot of the GL API works like that. At that point, asking whether glEnable(GL_BLEND) is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.
I could also expand a bit on data transfer but Danvil touched on it.
I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.
Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.
The question should perhaps be "What functions eat an unexpectedly high amount of CPU time?"
Keeping a matrix stack for projection and view is not a thing the GPU can handle better than a CPU would (on the contrary ...). Another example would be shader compilation. Why should this run on the GPU? There is a parser, a compiler, ..., which are just normal CPU programs like the C++ compiler.
Potentially "dangerous" function calls are for example glReadPixels, because data can be copied from host (=CPU) memory to device (=GPU) memory over the limited bus. In this category are also functions like glTexImage_D or glBufferData.
So generally speaking, if you want to know how much CPU time an OpenGL call eats, try to understand its functionality. And beware of all functions, which copy data from host to device and back!
Typically, if an operation is per-something, it will occur on the GPU. An example is the actual transformation - this is done once per vertex. On the other hand, if it occurs only once per large operation, it'll be on the CPU - such as creating the transformation matrix, which is only done once for each time the object's state changes, or once per frame.
That's just a general answer and some functionality will occur the other way around - as well as being implementation dependent. However, typically, it shouldn't matter to you, the programmer. As long as you allow the GPU plenty of time to do it's work while you're off doing the game sim or whatever, or have a solid threading model, you shouldn't need to worry about it that much.
#sending data to GPU: As far as I know (only used Direct3D) it's all done in-shader, that's what shaders are for.
glTranslate, glRotate and glScale change the current active transformation matrix. This is of course a CPU operation. The model view and projection matrices just describes how the GPU should transforms vertices when issue a rendering command.
So e.g. by calling glTranslate nothing is translated at all yet. Before rendering the current projection and model view matrices are multiplied (MVP = projection * modelview) then this single matrix is copied to the GPU and then the GPU does the matrix * vertex multiplications ("T&L") for each vertex. So the translation/scaling/projection of the vertices is done by the GPU.
Also you really should not be worried about the performance if you don't use these functions in an inner loop somewhere. glTranslate results in three additions. glScale and glRotate are a bit more complex.
My advice is that you should learn a bit more about linear algebra. This is essential for working with 3D APIs.
There are software rendered implementations of OpenGL, so it's possible that no OpenGL functions run on the GPU. There's also hardware that doesn't support certain render states in hardware, so if you set a certain state, switch to software rendering, and again, nothing will run on the GPU (even though there's one there). So I don't think there's any clear distinction between 'GPU-accelerated functions' and 'non-GPU accelerated functions'.
To be on the safe side, keep things as simple as possible. The straightforward rendering-with-vertices and basic features like Z buffering are most likely to be hardware accelerated, so if you can stick to that with the minimum state changing, you'll be most likely to keep things hardware accelerated. This is also the way to maximize performance of hardware-accelerated rendering - graphics cards like to stay in one state and just crunch a bunch of vertices.