Faster encoding of realtime 3d graphics with opengl and x264 - opengl

I am working on a system that sends a compressed video to a client from 3d graphics that are done in the server as soon as they are rendered.
I already have the code working, but I feel it could be much faster (and it is already a bottleneck in the system)
Here is what I am doing:
First I grab the framebuffer
glReadBuffer( GL_FRONT );
glReadPixels( 0, 0, width, height, GL_RGB, GL_UNSIGNED_BYTE, buffer );
Then I flip the framebuffer, because there is a weird bug with swsScale (which I am using for colorspace conversion) that flips the image vertically when I convert. I am flipping in advance, nothing fancy.
void VerticalFlip(int width, int height, byte* pixelData, int bitsPerPixel)
{
byte* temp = new byte[width*bitsPerPixel];
height--; //remember height array ends at height-1
for (int y = 0; y < (height+1)/2; y++)
{
memcpy(temp,&pixelData[y*width*bitsPerPixel],width*bitsPerPixel);
memcpy(&pixelData[y*width*bitsPerPixel],&pixelData[(height-y)*width*bitsPerPixel],width*bitsPerPixel);
memcpy(&pixelData[(height-y)*width*bitsPerPixel],temp,width*bitsPerPixel);
}
delete[] temp;
}
Then I convert it to YUV420p
convertCtx = sws_getContext(width, height, PIX_FMT_RGB24, width, height, PIX_FMT_YUV420P, SWS_FAST_BILINEAR, NULL, NULL, NULL);
uint8_t *src[3]= {buffer, NULL, NULL};
sws_scale(convertCtx, src, &srcstride, 0, height, pic_in.img.plane, pic_in.img.i_stride);
Then I pretty much just call the x264 encoder. I am already using the zerolatency preset.
int frame_size = x264_encoder_encode(_encoder, &nals, &i_nals, _inputPicture, &pic_out);
My guess is that there should be a faster way to do this. Capturing the frame and converting it to YUV420p. It would be nice to convert it to YUV420p in the GPU and only after that copying it to system memory, and hopefully there is a way to do color conversion without the need to flip.
If there is no better way, at least this question may help someone trying to do this, to do it the same way I did.

First , use async texture read using PBOs.Here is example It speeds ups the read by using 2 PBOs which work asynchronously without stalling the pipeline like readPixels does when used directly.In my app I got 80% performance boost when switched to PBOs.
Additionally , on some GPUs glGetTexImage() works faster than glReadPixels() so try it out.
But if you really want to take the video encoding to the next level you can do it via CUDA using Nvidia Codec Library.I recently asked the same question so this can be helpful.

Related

Direct write to D3D texture from kernel

I am playing around with NVDEC H.264 decoder from NVIDIA CUDA samples, one thing I've found out is once frame is decoded, it's converted from NV12 to BGRA buffer which is allocated on CUDA's side, then this buffer is copied to D3D BGRA texture.
I find this not very efficient in terms of memory usage, and want to convert NV12 frame directly to D3D texture with this kernel:
void Nv12ToBgra32(uint8_t *dpNv12, int nNv12Pitch, uint8_t *dpBgra, int nBgraPitch, int nWidth, int nHeight, int iMatrix)
So, create D3D texture (BGRA, D3D11_USAGE_DEFAULT, D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS, D3D11_CPU_ACCESS_WRITE, 1 mipmap),
then register and write it on CUDA side:
//Register
ck(cuGraphicsD3D11RegisterResource(&cuTexResource, textureResource, CU_GRAPHICS_REGISTER_FLAGS_NONE));
...
//Write output:
CUarray retArray;
ck(cuGraphicsMapResources(1, &cuTexResource, 0));
ck(cuGraphicsSubResourceGetMappedArray(&retArray, cuTexResource, 0, 0));
/*
yuvFramePtr (NV12) is uint8_t* from decoded frame,
it's stored within CUDA memory I believe
*/
Nv12ToBgra32(yuvFramePtr, w, (uint8_t*)retArray, 4 * w, w, h);
ck(cuGraphicsUnmapResources(1, &cuTexResource, 0));
Once kernel is called, I get crash. May be because of misusing CUarray, can anybody please clarify how to use output of cuGraphicsSubResourceGetMappedArray to write texture memory from CUDA kernel? (since writing raw memory is only needed, there is no need to handle correct clamp, filtering and value scaling)
Ok, for anyone who struggling on question "How to write D3D11 texture from CUDA kernel", here is how:
Create D3D texture with D3D11_BIND_UNORDERED_ACCESS.
Then, register resource:
//ID3D11Texture2D *textureResource from D3D texture
CUgraphicsResource cuTexResource;
ck(cuGraphicsD3D11RegisterResource(&cuTexResource, textureResource, CU_GRAPHICS_REGISTER_FLAGS_NONE));
//You can also add write-discard if texture will be fully written by kernel
ck(cuGraphicsResourceSetMapFlags(cuTexResource, CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD));
Once texture is created and registered we can use it as write surface.
ck(cuGraphicsMapResources(1, &cuTexResource, 0));
//Get array for first mip-map
CUArray retArray;
ck(cuGraphicsSubResourceGetMappedArray(&retArray, cuTexResource, 0, 0));
//Create surface from texture
CUsurfObject surf;
CUDA_RESOURCE_DESC surfDesc{};
surfDesc.res.array.hArray = retArray;
surfDesc.resType = CU_RESOURCE_TYPE_ARRAY;
ck(cuSurfObjectCreate(&surf, &surfDesc));
/*
Kernel declaration is:
void Nv12ToBgra32Surf(uint8_t* dpNv12, int nNv12Pitch, cudaSurfaceObject_t surf, int nBgraPitch, int nWidth, int nHeight, int iMatrix)
Surface write:
surf2Dwrite<uint>(VALUE, surf, x * sizeof(uint), y);
For BGRA surface we are writing uint, X offset is in bytes,
so multiply it with byte-size of type.
Run kernel:
*/
Nv12ToBgra32Surf(yuvFramePtr, w, /*out*/surf, 4 * w, w, h);
ck(cuGraphicsUnmapResources(1, &cuTexResource, 0));
ck(cuSurfObjectDestroy(surf));

draw part of image with openGL glDrawPixels

I have a function to draw an image in an openGL context. (used in that case to render to a texture) That works for the whole image, but should also be able to render only a rectangular part. Rendering parts works if the part has the same width as the image. For parts that are less wide than the image-data it fails.
Here is the function (reduced to only the part for small width, no cleanup,etc)
void drawImage(uint32 imageWidth, uint32 imageHeight, uint8* pData,
uint32 offX, uint32 partWidth) // (offX+partWidth<=imageWidth)
{
uint8* p(pData);
if (partWidth != imageWidth)
{
glPixelStorei(GL_PACK_ROW_LENGTH, imageWidth);
p = calcFrom(offX, pData); // point at pixel in row
}
glDrawPixels(partWidth, ImageHeight, GL_BGRA, GL_UNSIGNED_BYTE, p);
}
As said: if (widthPart==imageWidth) the rendering works fine. For some combinations of partWidth and imageWidth it works also but that seems to be a very special case, mainly width very small images and a some special partWidths.
I found no examples for this, but from the docs I think this shold be possible to do somehow like that. Did I missunderstand the whole thing, or have I just overseen a small pit-fall??
Thanks,
Moritz
P.S: it's running on windows
[Edited:] P.P.S: by now I have tried to do that as texture. If I replace glDrawPixels with glTexImage2D I have the same problem...(could upload the whole image and render only part, but for small small parts of big pictures that might not e the best way...)
AAArrrghh!!
GL_UNPACK_ROW_LENGTH not GL_PACK_ROW_LENGTH!!!!

What is the best way to fill AVFrame.data

I want to transfer opengl framebuffer data to AVCodec as fast as possible.
I've already converted RGB to YUV with shader and read it with glReadPixels
I still need to fill AVFrame data manually. Is there any better way?
AVFrame *frame;
// Y
frame->data[0][y*frame->linesize[0]+x] = data[i*3];
// U
frame->data[1][y*frame->linesize[1]+x] = data[i*3+1];
// V
frame->data[2][y*frame->linesize[2]+x] = data[i*3+2];
You can use sws_scale.
In fact, you don't need shaders for converting RGB->YUV. Believe me, it's not gonna have a very different performance.
swsContext = sws_getContext(WIDTH, HEIGHT, AV_PIX_FMT_RGBA, WIDTH, HEIGHT, AV_PIX_FMT_YUV, SWS_BICUBIC, 0, 0, 0 );
sws_scale(swsContext, (const uint8_t * const *)sourcePictureRGB.data, sourcePictureRGB.linesize, 0, codecContext->height, destinyPictureYUV.data, destinyPictureYUV.linesize);
The data in destinyPictureYUV will be ready to go to the codec.
In this sample, destinyPictureYUV is the AVFrame you want to fill up. Try to setup like this:
AVFrame * frame;
AVPicture destinyPictureYUV;
avpicture_alloc(&destinyPictureYUV, codecContext->pix_fmt, newCodecContext->width, newCodecContext->height);
// THIS is what you want probably
*reinterpret_cast<AVPicture *>(frame) = destinyPictureYUV;
With this setup you CAN ALSO fill up with the data you already converted to YUV in the GPU if you desire... you can choose the way you want.

Converting QImage to YUV420P pixel format

Has anybody solved this problem earlier? I need simple and fast method to convert QImage::bits() buffer from RGB32 to YUV420P pixel format. Can you help me?
libswscale, part of the ffmpeg project has optimized routines to perform colorspace conversions, scaling, and filtering. If you really want speed, I would suggest using it unless you cannot add the extra dependency. I haven't actually tested this code, but here is the general idea:
QImage img = ... //your image in RGB32
//allocate output buffer. use av_malloc to align memory. YUV420P
//needs 1.5 times the number of pixels (Cb and Cr only use 0.25
//bytes per pixel on average)
char* out_buffer = (char*)av_malloc((int)ceil(img.height() * img.width() * 1.5));
//allocate ffmpeg frame structures
AVFrame* inpic = avcodec_alloc_frame();
AVFrame* outpic = avcodec_alloc_frame();
//avpicture_fill sets all of the data pointers in the AVFrame structures
//to the right places in the data buffers. It does not copy the data so
//the QImage and out_buffer still need to live after calling these.
avpicture_fill((AVPicture*)inpic,
img.bits(),
AV_PIX_FMT_ARGB,
img.width(),
img.height());
avpicture_fill((AVPicture*)outpic,
out_buffer,
AV_PIX_FMT_YUV420P,
img.width(),
img.height());
//create the conversion context. you only need to do this once if
//you are going to do the same conversion multiple times.
SwsContext* ctx = sws_getContext(img.width(),
img.height(),
AV_PIX_FMT_ARGB,
img.width(),
img.height(),
AV_PIX_FMT_YUV420P,
SWS_BICUBIC,
NULL, NULL, NULL);
//perform the conversion
sws_scale(ctx,
inpic->data,
inpic->linesize,
0,
img.height(),
outpic->data,
outpic->linesize);
//free memory
av_free(inpic);
av_free(outpic);
//...
//free output buffer when done with it
av_free(out_buffer);
Like I said, I haven't tested this code so it may require some tweaks to get it working.

Stiching multiple textures/frames together in OpenGL using a Kinect

I came across the following situation:
I have a Kinect camera and I keep taking frames (but they are stored only when the user presses a key).
I am using the freenect library in order to retrieve the depth and the color of the frame (I am no interested in skeleton tracking or something like that).
For a single frame I am using the glpclview example that comes with the freenect library
After retrieving the space data from the Kinect sensor, in the glpclview example, the current frame it is drawn like this:
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3, GL_SHORT, 0, xyz);
glEnableClientState(GL_TEXTURE_COORD_ARRAY);
glTexCoordPointer(3, GL_SHORT, 0, xyz);
glEnable(GL_TEXTURE_2D);
glBindTexture(GL_TEXTURE_2D, gl_rgb_tex);
glTexImage2D(GL_TEXTURE_2D, 0, 3, 640, 480, 0, GL_RGB, GL_UNSIGNED_BYTE, rgb);
glPointSize(2.0f);
glDrawElements(GL_POINTS, 640*480, GL_UNSIGNED_INT, indices);
where
static unsigned int indices[480][640];
static short xyz[480][640][3];
char *rgb = 0;
short *depth = 0;
where:
rgb is the color information for the current frame
depth is the depth information for the current frame
xyz is constructed as :
xyz[i][j][0] = j
xyz[i][j]3 = i
xyz[i][j]4 = depth[i*640+j]
indices is (I guess only) array that keeps track of the rgb/depth data and is constructed as:
indices[i][j] = i*640+j
So far, so good, but now I need to render more that just one frame (some of them rotated and translated with a certain angle/offsets). How can I do this?
I'ved tried to increase the size of the arrays and keep reallocationg memory for each new frame, but how can I render them?
Should I change this current line to something else?
glTexImage2D(GL_TEXTURE_2D, 0, 3, 640, 480, 0, GL_RGB, GL_UNSIGNED_BYTE, rgb)
If so, to what values should I change 640 and 480 since now xyz and rgb is a contiguos pointer of 640x480x(number of frames)?
To get a better ideea, I am trying to get something similar to this in the end (except the robot :D ).
If somewone has a better ideea, hint anything on how I should approach this problem, please let me know.
It isn't as simple as allocating a bigger array.
If you want to stitch together multiple point-clouds to make a bigger map, you should look into the SLAM algorithms (that is what they are running in the video your link to). You can find many implementations at http://openslam.org. You might also look into an ICP algorithm (Iterative Closest Point) and the KinectFusion from Microsoft (and the open source KinFu implementation from PCL).