How to synchronize data for OpenGL compute shader soft body simulation - opengl

I'm trying to make a soft body physics simulation, using OpenGL compute shaders. I'm using a spring/mass model, where objects are modeled as being made out of a mesh of particles, connected by springs (Wikipedia link with more details). My plan is to have a big SSBO that stores the positions, velocities, and net forces for each particle. I'll have a compute shader that, for each spring, calculates the force between the particles on both ends of that spring (using Hook's law) and adds that to the net forces for those two particles. Then I'll have another compute shader that, for each particle, does some sort of Euler integration using the data from the SSBO, and then zeros the net forces for the next frame.
My problem is with memory synchronization. Each particle is attached to more than one spring, so in the first compute shader, different invocations will be adding to the same location in memory (the one holding the net force). The spring calculations don't use data from that variable, so the writes can take place in whatever order, but I'm unfamiliar with how OpenGL memory works, and I'm not sure how to avoid race conditions. In addition, from what I've read it seems like I'll need glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) between the calls to the spring and particle compute shaders, so that the data written by the former is visible to latter. Is this necessary/sufficient?
I'm afraid to experiment, because I'm not sure what protections OpenGL gives against undefined behavior and accidentally screwing up your computer.

different invocations will be adding to the same location in memory (the one holding the net force). The spring calculations don't use data from that variable, so the writes can take place in whatever order
In this case you would need to use atomicAdd() in GLSL to make sure two separate threads don't get into a race condition.
In your case I don't think this will be a performance issue, but you should be aware that atomicAdd() can cause a big slowdown in cases where many threads are hitting the same location in memory at the same time (they have to serialize and wait for eachother). This performance issue is called "contention", and depending on the problem, you can usually improve it a lot by using warp-level primitives to make sure only 1 thread within each warp needs to actually commit the atomicAdd() (or other atomic operation).
Also "warps" are Nvidia terminology, AMD calls them "wavefronts", and there are different names still on other hardware vendors and API's.
In addition, from what I've read it seems like I'll need glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT)
This is correct. Conceptually the way I think about it is that OpenGL compute shaders are async by default. This means, when you launch a compute shader, there's no guarantee when it will execute relative to subsequent commands. glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) will basically create a wait() between any draw/compute commands accessing that type of resource.

Related

What is GPU driven rendering? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Nowadays I'm hearing from different places about the so called GPU driven rendering which is a new paradigm of rendering which doesn't require draw calls at all, and that it is supported by the new versions of OpenGL and Vulkan APIs. Can someone explain how it actually works on conceptual level and what are the main differences with the traditional approach?
Overview
In order to render a scene, a number of things have to happen. You need to walk your scene graph to figure out which objects exist. For each object which exists, you now need to determine if it is visible. For each object which is visible, you need to figure out where its geometry is stored, which textures and buffers will be used to render that object, which shaders to use to render the object, and so forth. Then you render that object.
The "traditional" method handling this is for the CPU to handle this process. The scene graph lives in CPU-accessible memory. The CPU does visibility culling on that scene graph. The CPU takes the visible objects and access some CPU data about the geometry (OpenGL buffer object and texture names, Vulkan descriptor sets and VkBuffers, etc), shaders, etc, transferring this as state data to the GPU. Then the CPU issues a GPU command to render that object with that state.
Now, if we go back farther, the most "traditional" method doesn't involve a GPU at all. The CPU would just take this mesh and texture data, do vertex transformations, rasterizatization, and so forth, producing an image in CPU memory. However, we started off-loading some of this to a separate processor. We started with the rasterization stuff (the earliest graphics chips were just rasterizers; the CPU did all the vertex T&L). Then we incorporated the vertex transformations into the GPU. When we did that, we started having to store vertex data in GPU accessible memory so the GPU could read it on its own time.
We did all of that, off-loading these things to a separate processor for two reasons: the GPU was (much) faster at it, and the CPU can now spend its time doing something else.
GPU driven rendering is just the next stage in that process. We went from no GPU, to rasterization GPU, to vertex GPU, and now to scene-graph-level GPU. The "traditional" method offloads how to render to the GPU; GPU driven rendering offloads the decision of what to render.
Mechanism
Now, the reason we haven't been doing this all along is because the basic rendering commands all take data that comes from the CPU. glDrawArrays/Elements takes a number of parameters from the CPU. So even if we used the GPU to generate that data, we would need a full GPU/CPU synchronization so that the CPU could read the data... and give it right back to the GPU.
That's not helpful.
OpenGL 4 gave us indirect rendering of various forms. The basic idea is that, instead of taking those parameters from a function call, they're just data stored in GPU memory. The CPU still has to make a function call to start the rendering operation, but the actual parameters to that call are just data stored in GPU memory.
The other half of that requires the ability of the GPU to write data to GPU memory in a format that indirect rendering can read. Historically, data on GPUs goes in one direction: data gets read for the purpose of being converted into pixels in a render target. We need a way to generate semi-arbitrary data from other arbitrary data, all on the GPU.
The older mechanism for this was to (ab)use transform feedback for this purpose, but nowadays we just use SSBOs or failing that, image load/store. Compute shaders help here as well, since they are designed to be outside of the standard rendering pipeline and therefore are not bound to its limitations.
The ideal form of GPU-driven rendering makes the scene-graph part of the rendering operation. There are lesser forms, such as having the GPU do nothing more than per-object viewport culling. But let's look at the most ideal process. From the perspective of the CPU, this looks like:
Update the scene graph in GPU memory.
Issue one or more compute shaders that generate multi-draw indirect rendering commands.
Issue a single multi-draw indirect call that draws everything.
Now of course, there's no such thing as a free lunch. Doing full scene graph processing on the GPU requires building your scene graph in a way that is efficient for GPU processing. Even more importantly, visibility culling mechanisms have to be engineered with efficient GPU processing in mind. That's complexity I'm not going to address here.
Implementation
Instead, let's look at the nuts-and-bolts of making the drawing part work. We have to sort out a lot of things here.
See, the indirect rendering command is still a regular old rendering command. While the multi-draw form draws multiple distinct "objects", it's still one CPU rendering command. This means that, for the duration of this command, all rendering state is fixed.
So everything under the purview of this multi-draw operation must use the same shader, bound buffers&textures, blending parameters, stencil state, and so forth. This makes implementing a GPU-driven rendering operation a bit complicated.
State and Shaders
If you need blending, or similar state-based differences in rendering operations, then you are going to have to issue another rendering command. So in the blending case, your scene-graph processing is going to have to compute multiple sets of rendering commands, with each set being for a specific set of blending modes. You may also need to have this system sort transparent objects (unless you're rendering them with an OIT mechanism). So instead of having just one rendering command, you have a small number of them.
But the point of this exercise however isn't to have only one rendering command; the point is that the number of CPU rendering commands does not change with regard to how much stuff you're rendering. It shouldn't matter how many objects are in the scene; the CPU will be issuing the same number of rendering commands.
When it comes to shaders, this technique requires some degree of "ubershader" style: where you have a very few number of rather flexible shaders. You want to parameterize your shader rather than having dozens or hundreds of them.
However things were probably going to fall out that way anyway, particularly with regard to deferred rendering. The geometry pass of deferred renderers tends to use the same kind of processing, since they're just doing vertex transformation and extracting material parameters. The biggest difference usually is with regard to doing skinned vs. non-skinned rendering, but that's really only 2 shader variations. Which you can handle similarly to the blending case.
Speaking of deferred rendering, the GPU driven processes can also walk the graph of lights, thus generating the draw calls and rendering data for the lighting passes. So while the lighting pass will need a separate draw call, it will still only need a single multidraw call regardless of the number of lights.
Buffers
Here's where things start to get interesting. See, if the GPU is processing the scene graph, that means that the GPU needs to somehow associate a particular draw within the multi-draw command with the resources that particular draw needs. It may also need to put the data into those resources, like the matrix transforms for a given object and so forth.
Oh, and you also somehow need to tie the vertex input data to a particular sub-draw.
That last part is probably the most complicated. The buffers which OpenGL/Vulkan's standard vertex input method pull from are state data; they cannot change between sub-draws of a multi-draw operation.
Your best bet is to try to put every object's data in the same buffer object, using the same vertex format. Essentially, you have one gigantic array of vertex data. You can then use the drawing parameters for the sub-draw to select which parts of the buffer(s) to use.
But what do we do about per-object data (matrices, etc), things you would typically use a UBO or global uniform for? How do you effectively change the buffer binding state within a CPU rendering command?
Well... you can't. So you cheat.
First, you realize that SSBOs can be arbitrarily large. So you don't really need to change buffer binding state. What you need is a single SSBO that contains everyone's per-object data. For each vertex, the VS simply needs to pick out the correct data for that sub-draw from the giant list of data.
This is done via a special vertex shader input: gl_DrawID. When you issue a multi-draw command, the VS gets an input value that represents the index of this sub-draw operation within the multidraw command. So you can use gl_DrawID to index into a table of per-object data to fetch the appropriate data for that particular object.
This also means that the compute shader which generates this sub-draw also needs use the index of that sub-draw to define where in the array to put the per-object data for that sub-draw. So the CS that writes a sub-draw also needs to be responsible for setting up the per-object data that matches the sub-draw.
Textures
OpenGL and Vulkan have pretty strict limits on the number of textures that can be bound. Well actually those limits are quite large relative to traditional rendering, but in GPU driven rendering land, we need a single CPU rendering call to potentially access any texture. That's harder.
Now, we do have gl_DrawID; coupled with the table mentioned above, we can retrieve per-object data. So: how do we convert this to a texture?
There are multiple ways. We could put a bunch of our 2D textures into an array texture. We can then use gl_DrawID to fetch an array index from our SSBO of per-object data; that array index becomes the array layer we use to fetch "our" texture. Note that we don't use gl_DrawID directly because multiple different sub-draws could use the same texture, and because the GPU code that sets up the array of draw calls does not control the order in which textures appear in our array.
Array textures have obvious downsides, the most notable of which is that we must respect the limitations of an array texture. All elements in the array must use the same image format. They must all be of the same size. Also, there are limits on the number of array layers in an array texture, so you might encounter them.
The alternatives to array textures differ along API lines, though they basically boil down to the same thing: convert a number into a texture.
In OpenGL land, you can employ bindless texturing (for hardware that supports it). This system provides a mechanism that allows one to generate a 64-bit integer handle which represents a particular texture, pass this handle to the GPU (since it is just an integer, use whatever mechanism you want), and then convert this 64-bit handle into a sampler type. So you use gl_DrawID to fetch a 64-bit handle from the per-object data, then convert that into a sampler of the appropriate type and use it.
In Vulkan land, you can employ sampler arrays (for hardware that supports it). Note that these are not array textures; in GLSL, these are sampler types which are arrayed: uniform sampler2D my_2D_textures[6000];. In OpenGL, this would be a compile error because each array element represents a distinct bind point for a texture, and you cannot have 6000 distinct bind points. In Vulkan, an arrayed sampler only represents a single descriptor, no matter how many elements are in that array. Vulkan implementations have limits on how many elements there can be in such arrays, but hardware that supports the feature you need to employ this (shaderSampledImageArrayDynamicIndexing) will typically offer a generous limit.
So your shader uses gl_DrawID to get an index from the per-object data. The index is turned into a sampler by just fetching the value from the sampler array. The only limitation for textures in that arrayed descriptor is that they must all be of the same type and basic data format (floating-point 2D for sampler2D, unsigned integer cubemap for usamplerCube, etc). The specifics of formats, texture sizes, mipmap counts, and the like are all irrelevant.
And if you're concerned about the cost difference of Vulkan's array of samplers compared to OpenGL's bindless, don't be; implementations of bindless are just doing this behind your back anyway.

Comparing the multiDrawArrays, using primitive restart and multiDrawElements in terms of performance?

I want to draw a mass of branches with different shapes, each of which consisting of 4 triangle strips. (Using OpenGL)
So now I'm considering using one of those method calls (multiDrawArrays, using primitive restart and multiDrawElements).
I was wondering which one is more efficient. Is the method multiDrawArrays() equivalent to several drawArrays() in terms of speed?
Does VAO store the vertex info in the RAM while the SSBO store those in the VRAM? If so, is it better to use SSBO rather than VAO considering the performance?
As #derhass already pointed out in a comment, some of your terminology is mixed up. A VAO (Vertex Array Object) contains state that defines how vertex data is associated with vertex attributes. It's the VBO (Vertex Buffer Object) that contains the actual vertex data.
I doubt that using a SSBO for vertex data would be beneficial. In general, the buffer types primarily define how the data is used. The graphics pipeline is tailored towards fetching vertex data from VBOs, and many GPUs have dedicated fixed function hardware to pull data from VBOs and feed it into the vertex shader. I can't see how using explicit code in the vertex shader to pull the vertex data from a SSBO instead would be more efficient.
Whether the data is stored in VRAM or SRAM is a different consideration. The only control you have over that is with the last argument to glBufferData(). It provides a hint on how you plan to use the data. For example, if you specify GL_STATIC_DRAW, you're telling the driver that you're not planning to modify the data, which suggests that placing it in VRAM might be a good idea. Whether it will actually be in VRAM is then up to the driver, and it may decide that based on various criteria.
Functionally, glMultiDrawArrays() is equivalent to multiple calls to glDrawArrays(). But it can certainly be more efficient. If nothing else, it saves the overhead of making multiple API calls. Each API call has a certain amount of overhead, for example:
It might pass through a couple of software layers, resulting in additional function calls under the hood.
It needs to get the current context from thread local storage.
It needs to do error checking.
It may need some form of locking to deal with access from multiple contexts in multiple threads (might not be needed for a draw call).
It needs to check for pending state changes.
The MultiDraw calls were introduced to cut down on the number of API calls needed.
Now, whether glMultiDrawArrays() or glDrawElements() with primitive restart is more efficient, that's impossible to say in general. If you're not already using an index buffer, I would be a bit hesitant to introduce one just so that you can use primitive restart. So my instinct would be:
Use glDrawElements() with primitive restart if you're using an index buffer anyway.
Use glMultiDrawArrays() if you're not using an index buffer.
The real answer, as always, can only be obtained by benchmarking. And it can of course be platform/hardware dependent. My prediction is that you will not see a significant difference in most cases, since both of these allow you to draw a lot of your geometry with a single API call, which should avoid bottlenecks in this area.

Should Particle systems be updated entirely in the geometry shader

Should Particle systems be updated entirely in the geometry shader or should the geometry shader be passed updated data for positions and life ect. At the moment i update everything in the geometry but iam not sure if this is the best idea incase some of the data is needed in the C++.
It's possible to almost everything in shaders ( especially if you're going for SM4+ ). I don't recommend going for anything over SM3 if you want any sort of market penetration. I still regret we didn't provide a SM2 fallback for our latest game, because quite a few people still use old crappy SM2 cards.
More on to the question. You can use RTT and never to do a round trip back to the main memory ( this is slow as hell, minimize transfer from graphics memory to main memory ), but the down side is that you need to use some rather elaborate tricks to compute AABBs ( which you'll want on the CPU side of things ) if you go pure GPU.
Instead we do everything which requires changing the state of a particle on the CPU side. We then have a tight memory representation of that data which gets updated to GPU. The vertex shader is rather meaty ( but that's totally fine, do as much as you possibly can in the vertex shader! ), it extracts this compressed representation of a particle, transforms it, and passes the uncompressed data on to the pixel shader. An important observation here is that you can, and should, split per vertex & per particle data. This implies using instancing ( which is just a way of saying: use frequency dividers ). We represent particle rotation with a normal + rotation about that normal.
Another reason for doing the state changes of a particle CPU side is that it's a heck of a lot easier to composite behavior CPU side. Any at least half decent particle system needs quite a bit of knobs to turn to be able to create interesting particle effects.
EDIT: And if you have anything resembling Particle::Update that can't be inlined you've failed, minimize per particle function calls, especially virtual ones, and keep the memory representation of a particle tightly packed!
That depends on what kind of particle system you have. In most cases you have a software representation in C++ and a hardware representation for your shader. The geometry data for the shader is computed from the software representation and should be as little as possible. Because in most cases not the computational power is the limiting resource but the transfer rate to the graphics card.
If you can decrease transfers even further with your method, you can still keep a software representation in memory for further usage. Even if this implies computing data twice, it can be faster than the transfer process.

Organizing GLSL shaders in OpenGL engine

Which is better ?
To have one shader program with a lot of uniforms specifying
lights to use, or mappings to do (e.g. I need one mesh to be parallax mapped, and another one parallax/specular mapped). I'd make a cached list of uniforms for lazy transfers, and just change a couple of uniforms for every next mesh if it needs to do so.
To have a lot of shader programs for every needed case, each one with small amount of uniforms, and do the lazy bind with glUseProgram for every mesh if it needs to do so. Here I assume that meshes are properly batched, to avoid redundant switches.
Most modern engines I know have a "shader cache" and use the second option, because apparently it's faster.
Also you can take a look at the ARB_shader_subroutine which allows dynamic linkage. But I think it's only available on DX11 class hardware.
Generally, option 2 will be faster/better unless you have a truly huge number of programs. You can also use buffer objects shared across programs so that you need not reset any values when you change programs.
In addition, once you link a program, you can free all of the shaders that you linked into the program. This will free up all the source code and any pre-link info the driver is keeping around, leaving just the fully-linked program in memory.
I would tend to believe that it depends on the specific application. And yes since it would be more efficient to say run 100 programs where they each may have about 2-16 uniforms each; it may be better to have a trade off of the two. I would tend to think that having say maybe 10 - 20 programs for your most common shading techniques would be sufficient or a few more. For example you might want to have one program / shader to do all your bump mapping, one to do all of your fog effects, one to do reflections, one to do refractions.
Now outside the scope of your question I think it would pertain here as well, one thing to incorporate into your engine would be a BatchProcess & BatchManager class setup to reduce the amount of CPU - GPU calls over the bus as this would prove efficient as well. I don't think there is a 1 fits all solution to your question as I would believe that it would be application specific just as setting up the relationship between how many batches (buckets) of vertices (primitives) your engine would have and how many vertices each of those batches would contain.
To try to make this a bit more clear: one game might have 4 containers or batches where each batch can hold up to 10,000 vertices to be considered to be full before the BatchManager decides to empty that bucket sending all of those vertices to the Graphics Card for the Rendering pipeline to be processed and drawn where a different game may have 10 buckets with 5,000 vertices, or another game might have 8 buckets with 12,0000 vertices.
So there could be a trade off of trying to combine the two according to your needs. If you have 1 single program with 100s of uniforms; the single program is easier to manage within the pipeline, but the shaders would be over cumbersome to read and manage. Then again have shaders with very few uniforms is quite easy to read and manage but having 100s of programs is a little harder to manage on the CPU before linking and sending them to be rendered properly. I would personally try to find a middle ground to where I have enough programs to do each specific task that is completely unique from each other such as doing fog density on one and a volumetric shadow mapping on another where each program has just enough uniforms to do the calculations required.
The next step would then be to do some bench mark testing to see where you efficiency and your overhead are balanced to make the appropriate adjustments.

Which OpenGL functions are not GPU-accelerated?

I was shocked when I read this (from the OpenGL wiki):
glTranslate, glRotate, glScale
Are these hardware accelerated?
No, there are no known GPUs that
execute this. The driver computes the
matrix on the CPU and uploads it to
the GPU.
All the other matrix operations are
done on the CPU as well :
glPushMatrix, glPopMatrix,
glLoadIdentity, glFrustum, glOrtho.
This is the reason why these functions
are considered deprecated in GL 3.0.
You should have your own math library,
build your own matrix, upload your
matrix to the shader.
For a very, very long time I thought most of the OpenGL functions use the GPU to do computation. I'm not sure if this is a common misconception, but after a while of thinking, this makes sense. Old OpenGL functions (2.x and older) are really not suitable for real-world applications, due to too many state switches.
This makes me realise that, possibly, many OpenGL functions do not use the GPU at all.
So, the question is:
Which OpenGL function calls don't use the GPU?
I believe knowing the answer to the above question would help me become a better programmer with OpenGL. Please do share some of your insights.
Edit:
I know this question easily leads to optimisation level. It's good, but it's not the intention of this question.
If anyone knows a set of GL functions on a certain popular implementation (as AshleysBrain suggested, nVidia/ATI, and possibly OS-dependent) that don't use the GPU, that's what I'm after!
Plausible optimisation guides come later. Let's focus on the functions, for this topic.
Edit2:
This topic isn't about how matrix transformations work. There are other topics for that.
Boy, is this a big subject.
First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.
Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.
All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).
Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:
specialized tasks (rasterization, blending, texture filtering, blitting, ...)
heavily parallel workloads (DeadMG is pointing to that in his answer), when a CPU is more designed to handle single-threaded work.
And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.
It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.
Selection is another one where it's unclear whether there is value to executing on the GPU.
Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.
A lot of the GL API works like that. At that point, asking whether glEnable(GL_BLEND) is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.
I could also expand a bit on data transfer but Danvil touched on it.
I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.
Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.
The question should perhaps be "What functions eat an unexpectedly high amount of CPU time?"
Keeping a matrix stack for projection and view is not a thing the GPU can handle better than a CPU would (on the contrary ...). Another example would be shader compilation. Why should this run on the GPU? There is a parser, a compiler, ..., which are just normal CPU programs like the C++ compiler.
Potentially "dangerous" function calls are for example glReadPixels, because data can be copied from host (=CPU) memory to device (=GPU) memory over the limited bus. In this category are also functions like glTexImage_D or glBufferData.
So generally speaking, if you want to know how much CPU time an OpenGL call eats, try to understand its functionality. And beware of all functions, which copy data from host to device and back!
Typically, if an operation is per-something, it will occur on the GPU. An example is the actual transformation - this is done once per vertex. On the other hand, if it occurs only once per large operation, it'll be on the CPU - such as creating the transformation matrix, which is only done once for each time the object's state changes, or once per frame.
That's just a general answer and some functionality will occur the other way around - as well as being implementation dependent. However, typically, it shouldn't matter to you, the programmer. As long as you allow the GPU plenty of time to do it's work while you're off doing the game sim or whatever, or have a solid threading model, you shouldn't need to worry about it that much.
#sending data to GPU: As far as I know (only used Direct3D) it's all done in-shader, that's what shaders are for.
glTranslate, glRotate and glScale change the current active transformation matrix. This is of course a CPU operation. The model view and projection matrices just describes how the GPU should transforms vertices when issue a rendering command.
So e.g. by calling glTranslate nothing is translated at all yet. Before rendering the current projection and model view matrices are multiplied (MVP = projection * modelview) then this single matrix is copied to the GPU and then the GPU does the matrix * vertex multiplications ("T&L") for each vertex. So the translation/scaling/projection of the vertices is done by the GPU.
Also you really should not be worried about the performance if you don't use these functions in an inner loop somewhere. glTranslate results in three additions. glScale and glRotate are a bit more complex.
My advice is that you should learn a bit more about linear algebra. This is essential for working with 3D APIs.
There are software rendered implementations of OpenGL, so it's possible that no OpenGL functions run on the GPU. There's also hardware that doesn't support certain render states in hardware, so if you set a certain state, switch to software rendering, and again, nothing will run on the GPU (even though there's one there). So I don't think there's any clear distinction between 'GPU-accelerated functions' and 'non-GPU accelerated functions'.
To be on the safe side, keep things as simple as possible. The straightforward rendering-with-vertices and basic features like Z buffering are most likely to be hardware accelerated, so if you can stick to that with the minimum state changing, you'll be most likely to keep things hardware accelerated. This is also the way to maximize performance of hardware-accelerated rendering - graphics cards like to stay in one state and just crunch a bunch of vertices.