Confused about OpenGL transformations - opengl

In opengl there is one world coordinate system with origin (0,0,0).
What confuses me is what all the transformations like glTranslate, glRotate, etc. do? Do they move
objects in world coordinates, or do they move the camera? As you know, the same movement can be achieved by either moving objects or camera.
I am guessing that glTranslate, glRotate, change objects, and gluLookAt changes the camera?

In opengl there is one world coordinate system with origin (0,0,0).
Well, technically no.
What confuses me is what all the transformations like glTranslate, glRotate, etc. do? Do they move objects in world coordinates, or do they move the camera?
Neither. OpenGL doesn't know objects, OpenGL doesn't know a camera, OpenGL doesn't know a world. All that OpenGL cares about are primitives, points, lines or triangles, per vertex attributes, normalized device coordinates (NDC) and a viewport, to which the NDC are mapped to.
When you tell OpenGL to draw a primitive, each vertex is processed according to its attributes. The position is one of the attributes and usually a vector with 1 to 4 scalar elements within local "object" coordinate system. The task at hand is to somehow transform the local vertex position attribute into a position on the viewport. In modern OpenGL this happens within a small program, running on the GPU, called a vertex shader. The vertex shader may process the position in an arbitrary way. But the usual approach is by applying a number of nonsingular, linear transformations.
Such transformations can be expressed in terms of homogenous transformation matrices. For a 3 dimensional vector, the homogenous representation in a vector with 4 elements, where the 4th element is 1.
In computer graphics a 3-fold transformation pipeline has become sort of the standard way of doing things. First the object local coordinates are transformed into coordinates relative to the virtual "eye", hence into eye space. In OpenGL this transformation used to be called the modelview transformaion. With the vertex positions in eye space several calculations, like illumination can be expressed in a generalized way, hence those calculations happen in eye space. Next the eye space coordinates are tranformed into the so called clip space. This transformation maps some volume in eye space to a specific volume with certain boundaries, to which the geometry is clipped. Since this transformation effectively applies a projection, in OpenGL this used to be called the projection transformation.
After clip space the positions get "normalized" by their homogenous component, yielding normalized device coordinates, which are then plainly mapped to the viewport.
To recapitulate:
A vertex position is transformed from local to clip space by
vpos_eye = MV · vpos_local
eyespace_calculations(vpos_eye);
vpos_clip = P · vpos_eye
·: inner product column on row vector
Then to reach NDC
vpos_ndc = vpos_clip / vpos_clip.w
and finally to the viewport (NDC coordinates are in the range [-1, 1]
vpos_viewport = (vpos_ndc + (1,1,1,1)) * (viewport.width, viewport.height) / 2 + (viewport.x, viewport.y)
*: vector component wise multiplication
The OpenGL functions glRotate, glTranslate, glScale, glMatrixMode merely manipulate the transformation matrices. OpenGL used to have four transformation matrices:
modelview
projection
texture
color
On which of them the matrix manipulation functions act on can be set using glMatrixMode. Each of the matrix manipulating functions composes a new matrix by multiplying the transformation matrix they describe on top of the select matrix thereby replacing it. The functions glLoadIdentity replace the current matrix with identity, glLoadMatrix replaces it with a user defined matrix, and glMultMatrix multiplies a user defined matrix on top of it.
So how does the modelview matrix then emulate both object placement and a camera. Well, as you already stated
As you know, the same movement can be achieved by either moving objects or camera.
You can not really discern between them. The usual approach is by splitting the object local to eye transformation into two steps:
Object to world – OpenGL calls this the "model transform"
World to eye – OpenGL calls this the "view transform"
Together they form the model-view, in fixed function OpenGL described by the modelview matrix. Now since the order of transformations is
local to world, Model matrix vpos_world = M · vpos_local
world to eye, View matrix vpos_eye = V · vpos_world
we can substitute by
vpos_eye = V · ( M · vpos_local ) = V · M · vpos_local
replacing V · M by the ModelView matrix =: MV
vpos_eye = MV · vpos_local
Thus you can see that what's V and what's M of the compund matrix M is only determined by the order of operations in which you multiply onto the modelview matrix, and at which step you decide to "call it the model transform from here on".
I.e. right after a
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
the view is defined. But at some point you'll start applying model transformations and everything after is model.
Note that in modern OpenGL all the matrix manipulation functions have been removed. OpenGL's matrix stack never was feature complete and no serious application did actually use it. Most programs just glLoadMatrix-ed their self calculated matrices and didn't bother with the OpenGL built-in matrix maniupulation routines.
And ever since shaders were introduced, the whole OpenGL matrix stack got awkward to use, to say it nicely.
The verdict: If you plan on using OpenGL the modern way, don't bother with the built-in functions. But keep in mind what I wrote, because what your shaders do will be very similar to what OpenGL's fixed function pipeline did.

OpenGL is a low-level API, there is no higher-level concepts like an "object" and a "camera" in the "scene", so there are only two matrix modes: MODELVIEW (a multiplication of "camera" matrix by the "object" transformation) and PROJECTION (the projective transformation from world-space to post-perspective space).
Distinction between "Model" and "View" (object and camera) matrices is up to you. glRotate/glTranslate functions just multiply the currently selected matrix by the given one (without even distinguishing between ModelView and Projection).

Those functions multiply (transform) the current matrix set by glMatrixMode() so it depends on the matrix you're working on. OpenGL has 4 different types of matrices; GL_MODELVIEW, GL_PROJECTION, GL_TEXTURE, and GL_COLOR, any one of those functions can change any of those matrices. So, basically, you don't transform objects you just manipulate different matrices to "fake" that effect.
Note that glulookat() is just a convenient function equivalent to a translation followed by some rotations, there's nothing special about it.

All transformations are transformations on objects. Even gluLookAt is just a transformation to transform the objects as if the camera was where you tell it to be. Technically they are transformations on the vertices, but that's just semantics.

That's true, glTranslate, glRotate change the object coordinates before rendering and gluLookAt changes the camera coordinate.

Related

Why do modeview and camera matrices use RUB orientation

I usually find matrix libraries building both modelview and cameras matrices from the RUB (right-up-back) vectors, as depicted in these pages:
http://3dengine.org/Right-up-back_from_modelview
http://3dengine.org/Modelview_matrix
Is the RUB tuple just a common standard?
Otherwise, is there a reason the RUB vectors are preferred over any other orientation (such as forward-up-right)?
Particularly if you're using the programmable pipeline, you have almost complete freedom about the coordinate system you work in, and how you transform your geometry. But once all your transformations are applied in the vertex shader (resulting in the vector assigned to gl_Position), there is still a fixed function block in the pipeline between the vertex shader and fragment shader. That fixed function block relies on the transformed vertices being in a well defined coordinate system.
gl_Position is in a coordinate system called "clip coordinates", which then turns into "normalized device coordinates" (NDC) after dividing by the w coordinate of the vector.
Based on the vector in NDC, the fixed function rasterization block generates pixels. It will use the first coordinate to map to the horizontal window direction, and the second coordinate to map to the vertical window direction. The third coordinate will be used to calculate the depth, which can be used for depth testing.
This means that after all transformations are applied, the first coordinate has to be left-right, the second coordinate has to be bottom-up, and the third coordinate has to be front-back (well, it could be back-front if you change the depth test).
If you use a classic setup with modelview and projection matrix, it makes sense to use the modelview matrix to transform the original geometry into this orientation, and then use the projection matrix to apply e.g. a perspective.
I don't think there's anything stopping you from using a different orientation as the result of the modelview transformation, and then include a rotation in the projection matrix to transform the whole thing into the correct clip coordinate space. But I don't see a benefit, and it looks like it would just add unnecessary confusion.

Questions about Orthogonal vs Perspective in OpenGL [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I've gotten a 3 vertices triangle rotating around the y-axis. One of the things I find "weird" is normalized coordinates in GL's default orthogonal projection. I've used 2-D libraries like SDL and SFML, which almost always deal with pixels. You say you want an image surface that is 50x50 pixels, that is what you get. So initially it was strange for me to say limit my vertex position choices from [-1,1].
Why do orthogonal coordinates have to be normalized? Is perspective projection the same? If so, how would you say you want the origin of your object to be at z=-10? (My quick look over matrix m ath says perspective is different. Something about division by 'w' creating homogenous (same thing as normalized?) coordinates, but I'm not sure).
gl_Position = View * Model * Project * Vertex;
I've seen that equation above and I'm boggled by how the variable gl_Position used in shaders can represent both the position of the current vertices of a model/object, and at the same time a view/projection, or the position of the camera. How does that work? I understand by multiplication all that information is stored in one matrix, but how does OpenGL use that one matrix whose information is now combined to say, "ok, this part/fraction of gl_Position is for the camera, and that other part is information for where the model is going to go."? (BTW, I'm not quite sure what the Vertex vec4 represents. I thought all vertices of a model were inside Model. Any ideas?
One more question, if you just wanted to move the camera, for example in FPS games you move the mouse up to look up, but no objects are being rotated or translated (I think) other than the camera, would the equation above look something like this?
gl_Position = View * Project;
Why do orthogonal coordinates have to be normalized?
They don't. You can set the limits of the orthogonal projection volume however you desire. The left, right, bottom, top, near and far parameters of the glOrtho call define the limits of the viewport volume. If you chose them left=0, right=win_pixel_width, bottom=0, top=win_pixel_height you end up with a pixel unit projection volume as you're used to. However why bother with pixels? You'd just have to compensate for the actual window size later. Just choose the ortho projection volume extents to match the scene you want to draw.
Maybe you're confusing this with normalized device coordinates. And for those it simply has been defined that it is the value range [-1, 1] that's mapped to the viewport extents.
Update
BTW, I'm not quite sure what the Vertex vec4 represents. I thought all vertices of a model were inside Model. Any ideas?
I'm getting quite fatigued right now, because I've been answering several questions like this numerous times over the last few days. So, here it goes again:
In OpenGL there is no camera.
In OpenGL there is no scene.
In OpenGL there are no models.
"Wait, what?!" you may wonder now. But it's true.
All OpenGL cares about is, that there is some target framebuffer, i.e. a canvas it can draw to, and a stream of vertex attributes that make geometric primitives. The primitives are points, lines and triangles. Somehow the vertex attributes for, say a triangle, must be mapped to a position on the framebuffer canvas. For this an vertex attribute we call position goes through a number of affine transformations.
The first is from a local model space into world space , the Model transform.
From world space into eye space, the View transform. It is this view transform, which acts like placing the camera in a scene.
After that it put through the equivalent of a camera's lens, which is a Projection transform.
After the Projection transform the position is in clip space where it undergoes some operations, that are not essential to understand for the time being. After clipping the so called homogenous divide is applied to reach normalized device coordinate space, by dividing the clip space position vector by its own w-component.
v_position_ndc = v_position_clip / v_position_clip.w
This step is, what makes a perspective projection actually work. The z-distance of a vertex' position is worked into the clip space w-component. And by the homogenous divide vertices with a larger position w get scaled proportionally to 1/w in the XY plane which creates a perspective effect.
You mistook this operation as normalization, but it is not!
After the homogenous divide vertex position has been mapped from clip to NDC space. And OpenGL defines, that the visible volume of NDC space is the box [-1, 1]^3 ; vertices outside this box are clipped.
It's crucial to understand that View transform and Projection are different. For a position it's not so obvious, but another vertex attribute called the normal, which is an important ingredient for lighting calculations, must be transformed in a slightly different way (instead of Projection · View · Model it must be transformed by inverse(transpose(View · Model)), i.e. the Projection takes no part in it but the viewpoint does).
The matrices itself are 4×4 grids of real valued scalars (ignore for the time being that numbers in a computer are always rational numbers). So the rank of the matrix is 4 and hence it must be multiplied of vectors of dimension 4 (hence the type vec4)
OpenGL treats vertex attributes as column vectors so matrix multiplication is left associative i.e. a vector enters an expression on the right side and comes out on the left. The order of matrix multiplication matters. You can not freely reorder things!
The statement
gl_Position = Projection * View * Model * vertex_position; // note the order
makes the vertex shader perform this very transformation process I just described.
"Note that there is no separate camera (view) matrix in OpenGL. Therefore, in order to simulate transforming the camera or view, the scene (3D objects and lights) must be transformed with the inverse of the view transformation. In other words, OpenGL defines that the camera is always located at (0, 0, 0) and facing to -Z axis in the eye space coordinates, and cannot be transformed. See more details of GL_MODELVIEW matrix in ModelView Matrix."
Source: http://www.songho.ca/opengl/gl_transform.html
That is what I was getting hung up on. I thought there would a separate camera view matrix in OpenGL.

Why would it be beneficial to have a separate projection matrix, yet combine model and view matrix?

When you are learning 3D programming, you are taught that it's easiest think in terms of 3 transformation matrices:
The Model Matrix. This matrix is individual to every single model and it rotates and scales the object as desired and finally moves it to its final position within your 3D world. "The Model Matrix transforms model coordinates to world coordinates".
The View Matrix. This matrix is usually the same for a large number of objects (if not for all of them) and it rotates and moves all objects according to the current "camera position". If you imaging that the 3D scene is filmed by a camera and what is rendered on the screen are the images that were captured by this camera, the location of the camera and its viewing direction define which parts of the scene are visible and how the objects appear on the captured image. There are little reasons for changing the view matrix while rendering a single frame, but those do in fact exists (e.g. by rendering the scene twice and changing the view matrix in between, you can create a very simple, yet impressive mirror within your scene). Usually the view matrix changes only once between two frames being drawn. "The View Matrix transforms world coordinates to eye coordinates".
The Projection Matrix. The projection matrix decides how those 3D coordinates are mapped to 2D coordinates, e.g. if there is a perspective applied to them (objects get smaller the farther they are away from the viewer) or not (orthogonal projection). The projection matrix hardly ever changes at all. It may have to change if you are rendering into a window and the window size has changed or if you are rendering full screen and the resolution has changed, however only if the new window size/screen resolution has a different display aspect ratio than before. There are some crazy effects for that you may want to change this matrix but in most cases its pretty much constant for the whole live of your program. "The Projection Matrix transforms eye coordinates to screen coordinates".
This makes all a lot of sense to me. Of course one could always combine all three matrices into a single one, since multiplying a vector first by matrix A and then by matrix B is the same as multiplying the vector by matrix C, where C = B * A.
Now if you look at the classical OpenGL (OpenGL 1.x/2.x), OpenGL knows a projection matrix. Yet OpenGL does not offer a model or a view matrix, it only offers a combined model-view matrix. Why? This design forces you to permanently save and restore the "view matrix" since it will get "destroyed" by model transformations applied to it. Why aren't there three separate matrices?
If you look at the new OpenGL versions (OpenGL 3.x/4.x) and you don't use the classical render pipeline but customize everything with shaders (GLSL), there are no matrices available any longer at all, you have to define your own matrices. Still most people keep the old concept of a projection matrix and a model-view matrix. Why would you do that? Why not using either three matrices, which means you don't have to permanently save and restore the model-view matrix or you use a single combined model-view-projection (MVP) matrix, which saves you a matrix multiplication in your vertex shader for ever single vertex rendered (after all such a multiplication doesn't come for free either).
So to summarize my question: Which advantage has a combined model-view matrix together with a separate projection matrix over having three separate matrices or a single MVP matrix?
Look at it practically. First, the fewer matrices you send, the fewer matrices you have to multiply with positions/normals/etc. And therefore, the faster your vertex shaders.
So point 1: fewer matrices is better.
However, there are certain things you probably need to do. Unless you're doing 2D rendering or some simple 3D demo-applications, you are going to need to do lighting. This typically means that you're going to need to transform positions and normals into either world or camera (view) space, then do some lighting operations on them (either in the vertex shader or the fragment shader).
You can't do that if you only go from model space to projection space. You cannot do lighting in post-projection space, because that space is non-linear. The math becomes much more complicated.
So, point 2: You need at least one stop between model and projection.
So we need at least 2 matrices. Why model-to-camera rather than model-to-world? Because working in world space in shaders is a bad idea. You can encounter numerical precision problems related to translations that are distant from the origin. Whereas, if you worked in camera space, you wouldn't encounter those problems, because nothing is too far from the camera (and if it is, it should probably be outside the far depth plane).
Therefore: we use camera space as the intermediate space for lighting.
In most cases your shader will need the geometry in world or eye coordinates for shading so you have to seperate the projection matrix from the model and view matrices.
Making your shader multiply the geometry with two matrices hurts performance. Assuming each model have thousends (or more) vertices it is more efficient to compute a model view matrix in the cpu once, and let the shader do one less mtrix-vector multiplication.
I have just solved a z-buffer fighting problem by separating the projection matrix. There is no visible increase of the GPU load. The two folowing screenshots shows the two results - pay attention to the green and white layers fighting.

OpenGL define vertex position in pixels

I've been writing a 2D basic game engine in OpenGL/C++ and learning everything as I go along. I'm still rather confused about defining vertices and their "position". That is, I'm still trying to understand the vertex-to-pixels conversion mechanism of OpenGL. Can it be explained briefly or can someone point to an article or something that'll explain this. Thanks!
This is rather basic knowledge that your favourite OpenGL learning resource should teach you as one of the first things. But anyway the standard OpenGL pipeline is as follows:
The vertex position is transformed from object-space (local to some object) into world-space (in respect to some global coordinate system). This transformation specifies where your object (to which the vertices belong) is located in the world
Now the world-space position is transformed into camera/view-space. This transformation is determined by the position and orientation of the virtual camera by which you see the scene. In OpenGL these two transformations are actually combined into one, the modelview matrix, which directly transforms your vertices from object-space to view-space.
Next the projection transformation is applied. Whereas the modelview transformation should consist only of affine transformations (rotation, translation, scaling), the projection transformation can be a perspective one, which basically distorts the objects to realize a real perspective view (with farther away objects being smaller). But in your case of a 2D view it will probably be an orthographic projection, that does nothing more than a translation and scaling. This transformation is represented in OpenGL by the projection matrix.
After these 3 (or 2) transformations (and then following perspective division by the w component, which actually realizes the perspective distortion, if any) what you have are normalized device coordinates. This means after these transformations the coordinates of the visible objects should be in the range [-1,1]. Everything outside this range is clipped away.
In a final step the viewport transformation is applied and the coordinates are transformed from the [-1,1] range into the [0,w]x[0,h]x[0,1] cube (assuming a glViewport(0, w, 0, h) call), which are the vertex' final positions in the framebuffer and therefore its pixel coordinates.
When using a vertex shader, steps 1 to 3 are actually done in the shader and can therefore be done in any way you like, but usually one conforms to this standard modelview -> projection pipeline, too.
The main thing to keep in mind is, that after the modelview and projection transforms every vertex with coordinates outside the [-1,1] range will be clipped away. So the [-1,1]-box determines your visible scene after these two transformations.
So from your question I assume you want to use a 2D coordinate system with units of pixels for your vertex coordinates and transformations? In this case this is best done by using glOrtho(0.0, w, 0.0, h, -1.0, 1.0) with w and h being the dimensions of your viewport. This basically counters the viewport transformation and therefore transforms your vertices from the [0,w]x[0,h]x[-1,1]-box into the [-1,1]-box, which the viewport transformation then transforms back to the [0,w]x[0,h]x[0,1]-box.
These have been quite general explanations without mentioning that the actual transformations are done by matrix-vector-multiplications and without talking about homogenous coordinates, but they should have explained the essentials. This documentation of gluProject might also give you some insight, as it actually models the transformation pipeline for a single vertex. But in this documentation they actually forgot to mention the division by the w component (v" = v' / v'(3)) after the v' = P x M x v step.
EDIT: Don't forget to look at the first link in epatel's answer, which explains the transformation pipeline a bit more practical and detailed.
It is called transformation.
Vertices are set in 3D coordinates which is transformed into a viewport coordinates (into your window view). This transformation can be set in various ways. Orthogonal transformation can be easiest to understand as a starter.
http://www.songho.ca/opengl/gl_transform.html
http://www.opengl.org/wiki/Vertex_Transformation
http://www.falloutsoftware.com/tutorials/gl/gl5.htm
Firstly be aware that OpenGL not uses standard pixel coordinates. I mean by that for particular resolution, ie. 800x600 you dont have horizontal coordinates in range 0-799 or 1-800 stepped by one. You rather have coordinates ranged from -1 to 1 later send to graphic card rasterizing unit and after that matched to particular resolution.
I ommited one step here - before all that you have an ModelViewProjection matrix (or viewProjection matrix in some simple cases) which before all that will cast coordinates you use to an projection plane. Default use of that is to implement a camera which converts 3D space of world (View for placing an camera into right position and Projection for casting 3d coordinates into screen plane. In ModelViewProjection it's also step of placing a model into right place in world).
Another case (and you can use Projection matrix this way to achieve what you want) is to use these matrixes to convert one range of resolutions to another.
And there's a trick you will need. You should read about modelViewProjection matrix and camera in openGL if you want to go serious. But for now I will tell you that with proper matrix you can just cast your own coordinate system (and ie. use ranges 0-799 horizontaly and 0-599 verticaly) to standarized -1:1 range. That way you will not see that underlying openGL api uses his own -1 to 1 system.
The easiest way to achieve this is glOrtho function. Here's the link to documentation:
http://www.opengl.org/sdk/docs/man/xhtml/glOrtho.xml
This is example of proper usage:
glMatrixMode (GL_PROJECTION)
glLoadIdentity ();
glOrtho (0, 800, 600, 0, 0, 1)
glMatrixMode (GL_MODELVIEW)
Now you can use own modelView matrix ie. for translation (moving) objects but don't touch your projection example. This code should be executed before any drawing commands. (Can be after initializing opengl in fact if you wont use 3d graphics).
And here's working example: http://nehe.gamedev.net/tutorial/2d_texture_font/18002/
Just draw your figures instead of drawing text. And there is another thing - glPushMatrix and glPopMatrix for choosen matrix (in this example projection matrix) - you wont use that until you combining 3d with 2d rendering.
And you can still use model matrix (ie. for placing tiles somewhere in world) and view matrix (in example for zooming view, or scrolling through world - in this case your world can be larger than resolution and you could crop view by simple translations)
After looking at my answer I see it's a little chaotic but If you confused - just read about Model, View, and Projection matixes and try example with glOrtho. If you're still confused feel free to ask.
MSDN has a great explanation. It may be in terms of DirectX but OpenGL is more-or-less the same.
Google for "opengl rendering pipeline". The first five articles all provide good expositions.
The key transition from vertices to pixels (actually, fragments, but you won't be too far off if you think "pixels") is in the rasterization stage, which occurs after all vertices have been transformed from world-coordinates to screen coordinates and clipped.

How do I translate single objects in OpenGL 3.x?

I have a bit of experience writing OpenGL 2 applications and want to learn using OpenGL 3. For this I've bought the Addison Wesley "Red-book" and "Orange-book" (GLSL) which descirbe the deprecation of the fixed functionality and the new programmable pipeline (shaders). But what I can't get a grasp of is how to construct a scene with multiple objects without using the deprecated translate*, rotate* and scale* functions.
What I used to do in OGL2 was to "move about" in 3D space using the translate and rotate functions, and create the objects in local coordinates where I wanted them using glBegin ... glEnd. In OGL3 these functions are all deprecated, and, as I understand, replaced by shaders. But I can't call a shaderprogram for each and every object I make, can I? Wouldn't this affect all the other objects too?
I'm not sure if I've explained my problem satisfactory, but the core of it is how to program a scene with multiple objects defined in local coordinates in OpenGL 3.1. All the beginner tutorials I've found only uses a single object and doesn't have/solve this problem.
Edit: Imagine you want two spinning cubes. It would be a pain manually modifying each vertex coordinate, and you can't simply modify the modelview-matrix, because that would rather spin the camera around two static cubes...
Let's start with the basics.
Usually, you want to transform your local triangle vertices through the following steps:
local-space coords-> world-space coords -> view-space coords -> clip-space coords
In standard GL, the first 2 transforms are done through GL_MODELVIEW_MATRIX, the 3rd is done through GL_PROJECTION_MATRIX
These model-view transformations, for the many interesting transforms that we usually want to apply (say, translate, scale and rotate, for example), happen to be expressible as vector-matrix multiplication when we represent vertices in homogeneous coordinates. Typically, the vertex V = (x, y, z) is represented in this system as (x, y, z, 1).
Ok. Say we want to transform a vertex V_local through a translation, then a rotation, then a translation. Each transform can be represented as a matrix*, let's call them T1, R1, T2.
We want to apply the transform to each vertex: V_view = V_local * T1 * R1 * T2. Matrix multiplication being associative, we can compute once and for all M = T1 * R1 * T2.
That way, we only need to pass down M to the vertex program, and compute V_view = V_local * M. In the end, a typical vertex shader multiplies the vertex position by a single matrix. All the work to compute that one matrix is how you move your object from local space to the clip space.
Ok... I glanced over a number of important details.
First, what I described so far only really covers the transformation we usually want to do up to the view space, not the clip space. However, the hardware expects the output position of the vertex shader to be represented in that special clip-space. It's hard to explain clip-space coordinates without significant math, so I will leave that out, but the important bit is that the transformation that brings the vertices to that clip-space can usually be expressed as the same type of matrix multiplication. This is what the old gluPerspective, glFrustum and glOrtho compute.
Second, this is what you apply to vertex positions. The math to transform normals is somewhat different. That's because you want the normal to stay perpendicular to the surface after transformation (for reference, it requires a multiplication by the inverse-transpose of the model-view in the general case, but that can be simplified in many cases)
Third, you never send 4-D coordinates to the vertex shader. In general you pass 3-D ones. OpenGL will transform those 3-D coordinates (or 2-D, btw) to 4-D ones so that the vertex shader does not have to add the extra coordinate. it expands each vertex to add the 1 as the w coordinate.
So... to put all that back together, for each object, you need to compute those magic M matrices based on all the transforms that you want to apply to the object. Inside the shader, you then have to multiply each vertex position by that matrix and pass that to the vertex shader Position output. Typical code is more or less (this is using old nomenclature):
mat4 MVP;
gl_Position=MVP * gl_Vertex;
* the actual matrices can be found on the web, notably on the man pages for each of those functions: rotate, translate, scale, perspective, ortho
Those functions are apparently deprecated, but are technically still perfectly functional and indeed will compile. So you can certainly still use the translate3f(...) etc functions.
HOWEVER, this tutorial has a good explanation of how the new shaders and so on work, AND for multiple objects in space.
You can create x arrays of vertexes, and bind them into x VAO objects, and you render the scene from there with shaders etc...meh, it's easier for you to just read it - it is a really good read to grasp the new concepts.
Also, the OpenGL 'Red Book' as it is called has a new release - The Official Guide to Learning OpenGL, Versions 3.0 and 3.1. It includes 'Discussion of OpenGL’s deprecation mechanism and how to verify your programs for future versions of OpenGL'.
I hope that's of some assistance!