Better way to convert YUV to RGB in Unity3D with native plugins - c++

I have a Unity3D application that plays videos on the UI.
I find myself in need to find a better way to convert a YUV video buffer to a RGB buffer.
My situation is this:
Unity3D with a UI image that renders a video
Gstreamer external process which actually plays the video
A native plugin, called from Unity3D to convert the video YUV buffer to a RGBA one
My C++/Native plugin portion of code related to the YUV->RGBA conversion:
unsigned char * rgba = (unsigned char*) obj->g_RgbaBuff;
unsigned char * yuv = (unsigned char*) obj->g_Buffer;
while ( ta < obj->g_BufferLength)
{
int ty = (int)yuv[i];
int tu = (int)yuv[i + 1];
int tY2 = (int)yuv[i + 2];
int tv = (int)yuv[i + 3];
int tp1 = (int)(1.164f * (ty - 16));
int tr = Clamp((int)tp1 + 1.596f * (tv - 128));
int tg = Clamp((int)tp1 - 0.813f * (tv - 128) - 0.391f * (tu - 128));
int tb = Clamp((int)tp1 + 2.018f * (tu - 128));
rgba[ta] = tb;
rgba[ta + 1] = tg;
rgba[ta + 2] = tr;
ta += 4;
int tp2 = (int)(1.164f * (tY2 - 16));
int tr2 = Clamp((int)tp2 + 1.596f * (tv - 128));
int tg2 = Clamp((int)tp2 - 0.813f * (tv - 128) - 0.391f * (tu - 128));
int tb2 = Clamp((int)tp2 + 2.018f * (tu - 128));
rgba[ta] = tb2;
rgba[ta + 1] = tg2;
rgba[ta + 2] = tr2;
ta += 4;
}
This code gets called by Unity3D in a while loop to continuously update the output of my image, which is correctly showing the video.
Thing is, it's really slow. When I'm opening more than one video, my FPSs drop from 60 to way below 30 with just three 720p videos.
Is there a way to do this on the GPU? Or a smarter way to do it. Should I approach it in a different way?
To render the buffer to a texture I'm using this code in my native code, there's the rendering being done every frame by using Unity GL.IssuePluginEvent()
static void ModifyTexturePixels(void* textureHandle, int w, int h, void* rgbaBuff)
{
int textureRowPitch;
glBindTexture(GL_TEXTURE_2D, (GLuint)(size_t)textureHandle);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, w, h, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, rgbaBuff);
}

Related

Image subtraction with CUDA and textures

My goal is to use C++ with CUDA to subtract a dark frame from a raw image. I want to use textures for acceleration. The input of the images is cv::Mat with the type CV_8UC4 (I use the pointer to the data of the cv::Mat). This is the kernel I came up with, but I have no idea how to eventually subtract the textures from each other:
__global__ void DarkFrameSubtractionKernel(unsigned char* outputImage, size_t pitchOutputImage,
cudaTextureObject_t inputImage, cudaTextureObject_t darkImage, int width, int height)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockDim.y * blockIdx.y + threadIdx.y;
const float tx = (x + 0.5f);
const float ty = (y + 0.5f);
if (x >= width || y >= height) return;
uchar4 inputImageTemp = tex2D<uchar4>(inputImage, tx, ty);
uchar4 darkImageTemp = tex2D<uchar4>(darkImage, tx, ty);
outputImage[y * pitchOutputImage + x] = inputImageTemp - darkImageTemp; // this line will throw an error
}
This is the function that calls the kernel (you can see that I create the textures from unsigned char):
void subtractDarkImage(unsigned char* inputImage, size_t pitchInputImage, unsigned char* outputImage,
size_t pitchOutputImage, unsigned char* darkImage, size_t pitchDarkImage, int width, int height,
cudaStream_t stream)
{
cudaResourceDesc resDesc = {};
resDesc.resType = cudaResourceTypePitch2D;
resDesc.res.pitch2D.width = width;
resDesc.res.pitch2D.height = height;
resDesc.res.pitch2D.devPtr = inputImage;
resDesc.res.pitch2D.pitchInBytes = pitchInputImage;
resDesc.res.pitch2D.desc = cudaCreateChannelDesc(8, 8, 8, 8, cudaChannelFormatKindUnsigned);
cudaTextureDesc texDesc = {};
texDesc.readMode = cudaReadModeElementType;
texDesc.addressMode[0] = cudaAddressModeBorder;
texDesc.addressMode[1] = cudaAddressModeBorder;
cudaTextureObject_t imageInputTex, imageDarkTex;
CUDA_CHECK(cudaCreateTextureObject(&imageInputTex, &resDesc, &texDesc, 0));
resDesc.res.pitch2D.devPtr = darkImage;
resDesc.res.pitch2D.pitchInBytes = pitchDarkImage;
CUDA_CHECK(cudaCreateTextureObject(&imageDarkTex, &resDesc, &texDesc, 0));
dim3 block(32, 8);
dim3 grid = paddedGrid(block.x, block.y, width, height);
DarkImageSubtractionKernel << <grid, block, 0, stream >> > (reinterpret_cast<uchar4*>(outputImage), pitchOutputImage / sizeof(uchar4),
imageInputTex, imageDarkTex, width, height);
CUDA_CHECK(cudaDestroyTextureObject(imageInputTex));
CUDA_CHECK(cudaDestroyTextureObject(imageDarkTex));
}
The code does not compile as I can not subtract a uchar4 from another one (in the kernel). Is there an easy way of subtraction here?
Help is very much appreciated.
Is there an easy way of subtraction here?
There are no arithmetic operators defined for CUDA built-in vector types. If you replace
outputImage[y * pitchOutputImage + x] = inputImageTemp - darkImageTemp;
with
uchar4 val;
val.x = inputImageTemp.x - darkImageTemp.x;
val.y = inputImageTemp.y - darkImageTemp.y;
val.z = inputImageTemp.z - darkImageTemp.z;
val.w = inputImageTemp.w - darkImageTemp.w;
outputImage[y * pitchOutputImage + x] = val;
things will work. If this offends you, I suggest writing a small library of helper functions to hide the mess.

FreeImage wrong image color

I am trying to extract frames from a stream which I create with Gstreamer and trying to save them with FreeImage or QImage ( this one is for testing ).
GstMapInfo bufferInfo;
GstBuffer *sampleBuffer;
GstStructure *capsStruct;
GstSample *sample;
GstCaps *caps;
int width, height;
const int BitsPP = 32;
/* Retrieve the buffer */
g_signal_emit_by_name (sink, "pull-sample", &sample);
if (sample) {
sampleBuffer = gst_sample_get_buffer(sample);
gst_buffer_map(sampleBuffer,&bufferInfo,GST_MAP_READ);
if (!bufferInfo.data) {
g_printerr("Warning: could not map GStreamer buffer!\n");
throw;
}
caps = gst_sample_get_caps(sample);
capsStruct= gst_caps_get_structure(caps,0);
gst_structure_get_int(capsStruct,"width",&width);
gst_structure_get_int(capsStruct,"height",&height);
auto bitmap = FreeImage_Allocate(width, height, BitsPP,0,0,0);
memcpy( FreeImage_GetBits( bitmap ), bufferInfo.data, width * height * (BitsPP/8));
// int pitch = ((((BitsPP * width) + 31) / 32) * 4);
// auto bitmap = FreeImage_ConvertFromRawBits(bufferInfo.data,width,height,pitch,BitsPP,0, 0, 0);
FreeImage_FlipHorizontal(bitmap);
bitmap = FreeImage_RotateClassic(bitmap,180);
static int id = 0;
std::string name = "/home/stadmin/pic/sample" + std::to_string(id++) + ".png";
#ifdef FREE_SAVE
FreeImage_Save(FIF_PNG,bitmap,name.c_str());
#endif
#ifdef QT_SAVE
//Format_ARGB32
QImage image(bufferInfo.data,width,height,QImage::Format_ARGB32);
image.save(QString::fromStdString(name));
#endif
fibPipeline.push(bitmap);
gst_sample_unref(sample);
gst_buffer_unmap(sampleBuffer, &bufferInfo);
return GST_FLOW_OK;
The color output in FreeImage are totally wrong like when Qt - Format_ARGB32 [ greens like blue or blues like oranges etc.. ] but when I test with Qt - Format_RGBA8888 I can get correct output. I need to use FreeImage and I wish to learn how to correct this.
Since you say Qt succeeds using Format_RGBA8888, I can only guess: the gstreamer frame has bytes in RGBA order while FreeImage expects ARGB.
Quick fix:
//have a buffer the same length of the incoming bytes
size_t length = width * height * (BitsPP/8);
BYTE * bytes = (BYTE *) malloc(length);
//copy the incoming bytes to it, in the right order:
int index = 0;
while(index < length)
{
bytes[index] = bufferInfo.data[index + 2]; //B
bytes[index + 1] = bufferInfo.data[index + 1]; //G
bytes[index + 2] = bufferInfo.data[index]; //R
bytes[index + 3] = bufferInfo.data[index + 3]; //A
index += 4;
}
//fill the bitmap using the buffer
auto bitmap = FreeImage_Allocate(width, height, BitsPP,0,0,0);
memcpy( FreeImage_GetBits( bitmap ), bytes, length);
//don't forget to
free(bytes);

OpenCV: Random alpha channel artifacts when overlaying images with transparency in iOS

In my iOS Project i am adding small PNG Images including alpha channel as overlay on a JPEG Picture. The result on my device in DEBUG mode is as expected, the tears are drawn correctly.
When i run the same code on Simulator or when i archive and export the App in RELEASE mode i get random artifacts in alpha channel.
The underlying cv::Mat all contain header infos and a valid data section. Even on green background the error is reproducible.
The behaviour seem to be totally random as from time to time no artifacts are drawn (image 3: right tear, image 4: left tear).
Ideas, anybody?
const char *cpath1 = [#"" cStringUsingEncoding:NSUTF8StringEncoding];//overlay image path , within #"" pass your image path which is in NSString
const char *cpath = [#"" cStringUsingEncoding:NSUTF8StringEncoding];//underlay imagepath
cv::Mat overlay = cv::imread(cpath1,-1);//-1 is for read .png images
cv::Mat underlay = cv::imread(cpath,-1);
//convert mat image in to RGB channel
cv::Mat overlayAlpha;
std::vector<Mat> channels1;
split(overlay, channels1);
channels1[3].copyTo(overlayAlpha);
cv::Mat underlayAlpha;
std::vector<Mat> channels2;
split(underlay, channels2);
channels2[3].copyTo(underlayAlpha);
overlayImage( &underlay, &overlay,cv::Point(10,10);
convert final image to RGB channel
cv::split(underlay,channels1);
std::swap(channels1[0],channels1[2]);// swap B and R channels.
cv::merge(channels1,underlay);//merge channels
MatToUIImage(background); //display your final image, it returns cv::Mat image
and overlay function is like below
overlay function referenced from : http://answers.opencv.org/question/73016/how-to-overlay-an-png-image-with-alpha-channel-to-another-png/
void overlayImage(Mat* src, Mat* overlay, const cv::Point& location){
for (int y = max(location.y, 0); y < src->rows; ++y)
{
int fY = y - location.y;
if (fY >= overlay->rows)
break;
for (int x = max(location.x, 0); x < src->cols; ++x)
{
int fX = x - location.x;
if (fX >= overlay->cols)
break;
double opacity = ((double)overlay->data[fY * overlay->step + fX * overlay->channels() + 3]) / 255;
for (int c = 0; opacity > 0 && c < src->channels(); ++c)
{
unsigned char overlayPx = overlay->data[fY * overlay->step + fX * overlay->channels() + c];
unsigned char srcPx = src->data[y * src->step + x * src->channels() + c];
src->data[y * src->step + src->channels() * x + c] = srcPx * (1. - opacity) + overlayPx * opacity;
}
}
}
}

Does videoInput guarantee rgb camera input? (Transfering image from videoInput/dshow -> Java BufferedImage)

I am using videoInput to get a live stream from my webcam, but I've ran into a problem where videoInput's documentation implies that I should always be getting a BGR/RGB, however, the "verbose" output tells me the pixel format is YUY2.
***** VIDEOINPUT LIBRARY - 0.1995 - TFW07 *****
SETUP: Setting up device 0
SETUP: 1.3M WebCam
SETUP: Couldn't find preview pin using SmartTee
SETUP: Default Format is set to 640 by 480
SETUP: trying format RGB24 # 640 by 480
SETUP: trying format RGB32 # 640 by 480
SETUP: trying format RGB555 # 640 by 480
SETUP: trying format RGB565 # 640 by 480
SETUP: trying format YUY2 # 640 by 480
SETUP: Capture callback set
SETUP: Device is setup and ready to capture.
My first thoughts were to try converting to RGB (assuming I was really getting YUY2 data), and I end up getting a blue image that was highly distorted.
Here is my code for converting YUY2 to BGR (Note: This is part of a much larger program, and this is borrowed code - I can get the url at anyone's request):
#define CLAMP_MIN( in, min ) ((in) < (min))?(min):(in)
#define CLAMP_MAX( in, max ) ((in) > (max))?(max):(in)
#define FIXNUM 16
#define FIX(a, b) ((int)((a)*(1<<(b))))
#define UNFIX(a, b) ((a+(1<<(b-1)))>>(b))
#define ICCIRUV(x) (((x)<<8)/224)
#define ICCIRY(x) ((((x)-16)<<8)/219)
#define CLIP(t) CLAMP_MIN( CLAMP_MAX( (t), 255 ), 0 )
#define GET_R_FROM_YUV(y, u, v) UNFIX((FIX(1.0, FIXNUM)*(y) + FIX(1.402, FIXNUM)*(v)), FIXNUM)
#define GET_G_FROM_YUV(y, u, v) UNFIX((FIX(1.0, FIXNUM)*(y) + FIX(-0.344, FIXNUM)*(u) + FIX(-0.714, FIXNUM)*(v)), FIXNUM)
#define GET_B_FROM_YUV(y, u, v) UNFIX((FIX(1.0, FIXNUM)*(y) + FIX(1.772, FIXNUM)*(u)), FIXNUM)
bool yuy2_to_rgb24(int streamid) {
int i;
unsigned char y1, u, y2, v;
int Y1, Y2, U, V;
unsigned char r, g, b;
int size = stream[streamid]->config.g_h * (stream[streamid]->config.g_w / 2);
unsigned long srcIndex = 0;
unsigned long dstIndex = 0;
try {
for(i = 0 ; i < size ; i++) {
y1 = stream[streamid]->vi_buffer[srcIndex];
u = stream[streamid]->vi_buffer[srcIndex+ 1];
y2 = stream[streamid]->vi_buffer[srcIndex+ 2];
v = stream[streamid]->vi_buffer[srcIndex+ 3];
Y1 = ICCIRY(y1);
U = ICCIRUV(u - 128);
Y2 = ICCIRY(y2);
V = ICCIRUV(v - 128);
r = CLIP(GET_R_FROM_YUV(Y1, U, V));
//r = (unsigned char)CLIP( (1.164f * (float(Y1) - 16.0f)) + (1.596f * (float(V) - 128)) );
g = CLIP(GET_G_FROM_YUV(Y1, U, V));
//g = (unsigned char)CLIP( (1.164f * (float(Y1) - 16.0f)) - (0.813f * (float(V) - 128.0f)) - (0.391f * (float(U) - 128.0f)) );
b = CLIP(GET_B_FROM_YUV(Y1, U, V));
//b = (unsigned char)CLIP( (1.164f * (float(Y1) - 16.0f)) + (2.018f * (float(U) - 128.0f)) );
stream[streamid]->rgb_buffer[dstIndex] = b;
stream[streamid]->rgb_buffer[dstIndex + 1] = g;
stream[streamid]->rgb_buffer[dstIndex + 2] = r;
dstIndex += 3;
r = CLIP(GET_R_FROM_YUV(Y2, U, V));
//r = (unsigned char)CLIP( (1.164f * (float(Y2) - 16.0f)) + (1.596f * (float(V) - 128)) );
g = CLIP(GET_G_FROM_YUV(Y2, U, V));
//g = (unsigned char)CLIP( (1.164f * (float(Y2) - 16.0f)) - (0.813f * (float(V) - 128.0f)) - (0.391f * (float(U) - 128.0f)) );
b = CLIP(GET_B_FROM_YUV(Y2, U, V));
//b = (unsigned char)CLIP( (1.164f * (float(Y2) - 16.0f)) + (2.018f * (float(U) - 128.0f)) );
stream[streamid]->rgb_buffer[dstIndex] = b;
stream[streamid]->rgb_buffer[dstIndex + 1] = g;
stream[streamid]->rgb_buffer[dstIndex + 2] = r;
dstIndex += 3;
srcIndex += 4;
}
return true;
} catch(...) {
return false;
}
}
After this wasn't working, I assume either a) my color space conversion function is wrong, or b) videoInput is lying to me.
Well, I wanted to double check that videoInput was indeed telling me the truth, and it turns out there is absolutely no way for me to see the pixel format I'm getting from the videoInput::getPixels() function, outside of the verbose text (unless I'm extremely crazy and just can't see it). This makes me assume that it's possible that videoInput does some sort of color space conversion behind the scenes so you're always getting a consistent image, regardless of the webcam. With this in mind, and following some of the documentation in videoInput.h:96, it sort of appears that it just gives out RGB or BGR images.
The utility I'm using to display the image takes RGB images (Java BufferedImage), so I figured I could just feed it the raw data directly from videoInput, and it should be fine.
Here is how I've got my image setup in Java:
BufferedImage buffer = new BufferedImage(directShow.device_stream_width(stream),directShow.device_stream_height(stream), BufferedImage.TYPE_INT_RGB );
int rgbdata[] = directShow.grab_frame_stream(stream);
if( rgbdata.length > 0 ) {
buffer.setRGB(
0, 0,
directShow.device_stream_width(stream),
directShow.device_stream_height(stream),
rgbdata,
0, directShow.device_stream_width(stream)
);
}
And here is how I send it to Java (C++/JNI):
JNIEXPORT jintArray JNICALL Java_directshowcamera_dsInterface_grab_1frame_1stream(JNIEnv *env, jobject obj, jint streamid)
{
//jclass bbclass = env->FindClass( "java/nio/IntBuffer" );
//jmethodID putMethod = env->GetMethodID(bbclass, "put", "(B)Ljava/nio/IntBuffer;");
int buffer_size;
jintArray ia;
jint *intbuffer = NULL;
unsigned char *buffer = NULL;
append_stream( streamid );
buffer_size = stream_device_rgb24_size(streamid);
ia = env->NewIntArray( buffer_size );
intbuffer = (jint *)calloc( buffer_size, sizeof(jint) );
buffer = stream_device_buffer_rgb( streamid );
if( buffer == NULL ) {
env->DeleteLocalRef( ia );
return env->NewIntArray( 0 );
}
for(int i=0; i < buffer_size; i++ ) {
intbuffer[i] = (jint)buffer[i];
}
env->SetIntArrayRegion( ia, 0, buffer_size, intbuffer );
free( intbuffer );
return ia;
}
This has been driving me absolutely nuts for the past two weeks, and I've tried variations of anything suggested to me as well, with absolutely no sane success.

Issue with writing YUV image frame in C/C++

I am trying to convert an RGB frame, which is taken from OpenGL glReadPixels(), to a YUV frame, and write the YUV frame to a file (.yuv). Later on I would like to write it to a named_pipe as an input for FFMPEG, but as for now I just want to write it to a file and view the image result using a YUV Image Viewer. So just disregard the "writing to pipe" for now.
After running my code, I encountered the following errors:
The number of frames shown in the YUV Image Viewer software is always 1/3 of the number of frames I declared in my program. When I declare fps as 10, I could only view 3 frames. When I declared fps as 30, I could only view 10 frames. However when I view the file in Text Editor, I could see that I have the correct amount of word "FRAME" printed in the file.
This is the example output that I got: http://www.bobdanani.net/image.yuv
I could not see the correct image, but just some distorted green, blue, yellow, and black pixels.
I read about YUV format from http://wiki.multimedia.cx/index.php?title=YUV4MPEG2 and http://www.fourcc.org/fccyvrgb.php#mikes_answer and http://kylecordes.com/2007/pipe-ffmpeg
Here is what I have tried so far. I know that this conversion approach is quite in-efficient, and I can optimize it later. Now I just want to get this naive approach to work and have the image shown properly.
int frameCounter = 1;
int windowWidth = 0, windowHeight = 0;
unsigned char *yuvBuffer;
unsigned long bufferLength = 0;
unsigned long frameLength = 0;
int fps = 10;
void display(void) {
/* clear the color buffers */
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
/* DRAW some OPENGL animation, i.e. cube, sphere, etc
.......
.......
*/
glutSwapBuffers();
if ((frameCounter % fps) == 1){
bufferLength = 0;
windowWidth = glutGet(GLUT_WINDOW_WIDTH);
windowHeight = glutGet (GLUT_WINDOW_HEIGHT);
frameLength = (long) (windowWidth * windowHeight * 1.5 * fps) + 100; // YUV 420 length (width*height*1.5) + header length
yuvBuffer = new unsigned char[frameLength];
write_yuv_frame_header();
}
write_yuv_frame();
frameCounter = (frameCounter % fps) + 1;
if ( (frameCounter % fps) == 1){
snprintf(filename, 100, "out/image-%d.yuv", seq_num);
ofstream out(filename, ios::out | ios::binary);
if(!out) {
cout << "Cannot open file.\n";
}
out.write (reinterpret_cast<char*> (yuvBuffer), bufferLength);
out.close();
bufferLength = 0;
delete[] yuvBuffer;
}
}
void write_yuv_frame_header (){
char *yuvHeader = new char[100];
sprintf (yuvHeader, "YUV4MPEG2 W%d H%d F%d:1 Ip A0:0 C420mpeg2 XYSCSS=420MPEG2\n", windowWidth, windowHeight, fps);
memcpy ((char*)yuvBuffer + bufferLength, yuvHeader, strlen(yuvHeader));
bufferLength += strlen (yuvHeader);
delete (yuvHeader);
}
void write_yuv_frame() {
int width = glutGet(GLUT_WINDOW_WIDTH);
int height = glutGet(GLUT_WINDOW_HEIGHT);
memcpy ((void*) (yuvBuffer+bufferLength), (void*) "FRAME\n", 6);
bufferLength +=6;
long length = windowWidth * windowHeight;
long yuv420FrameLength = (float)length * 1.5;
long lengthRGB = length * 3;
unsigned char *rgb = (unsigned char *) malloc(lengthRGB * sizeof(unsigned char));
unsigned char *yuvdest = (unsigned char *) malloc(yuv420FrameLength * sizeof(unsigned char));
glReadPixels(0, 0, windowWidth, windowHeight, GL_RGB, GL_UNSIGNED_BYTE, rgb);
int r, g, b, y, u, v, ypos, upos, vpos;
for (int j = 0; j < windowHeight; ++j){
for (int i = 0; i < windowWidth; ++i){
r = (int)rgb[(j * windowWidth + i) * 3 + 0];
g = (int)rgb[(j * windowWidth + i) * 3 + 1];
b = (int)rgb[(j * windowWidth + i) * 3 + 2];
y = (int)(r * 0.257 + g * 0.504 + b * 0.098) + 16;
u = (int)(r * 0.439 + g * -0.368 + b * -0.071) + 128;
v = (int)(r * -0.148 + g * -0.291 + b * 0.439 + 128);
ypos = j * windowWidth + i;
upos = (j/2) * (windowWidth/2) + i/2 + length;
vpos = (j/2) * (windowWidth/2) + i/2 + length + length/4;
yuvdest[ypos] = y;
yuvdest[upos] = u;
yuvdest[vpos] = v;
}
}
memcpy ((void*) (yuvBuffer + bufferLength), (void*)yuvdest, yuv420FrameLength);
bufferLength += yuv420FrameLength;
free (yuvdest);
free (rgb);
}
This is just the very basic approach, and I can optimize the conversion algorithm later.
Can anyone tell me what is wrong in my approach? My guess is that one of the issues is with the outstream.write() call, because I converted the unsigned char* data to char* data that it may lose data precision. But if I don't cast it to char* I will get a compile error. However this doesn't explain why the output frames are corrupted (only account to 1/3 of the number of total frames).
It looks to me like you have too many bytes per frame for 4:2:0 data. ACcording to the spec you linked to, the number of bytes for a 200x200 pixel 4:2:0 frame should be 200 * 200 * 3 / 2 = 60,000. But you have ~90,000 bytes. Looking at your code, I don't see where you are convert from 4:4:4 to 4:2:0. So you have 2 choices - either set the header to 4:4:4, or convert the YCbCr data to 4:2:0 before writing it out.
I compiled your code and surely there is a problem when computing upos and vpos values.
For me this worked (RGB to YUV NV12):
vpos = length + (windowWidth * (j/2)) + (i/2)*2;
upos = vpos + 1;