I need to decode video but my video player only supports RGB8 pixel format. So I'm looking into how to do pixel format conversion in the GPU, preferably in the decoding process, but if not possible, after it.
I've found How to set decode pixel format in libavcodec? which explains how to decode video on ffmpeg to an specific pixel format, as long as it's suported by the codec.
Basically, get_format() is a function which chooses, from a list of supported pixel formats from the codec, a pixel format for the decoded video. My questions are:
Is this list of supported codec output formats the same for all computers? For example, if my codec is for H264, then it will always give me the same list on all computers? (assuming same ffmpeg version of all computers)
If I choose any of these supported pixel formats, will the pixel format conversion always happen in the GPU?
If some of the pixel format conversions won't happen in the GPU, then my question is: does sws_scale() function converts in the GPU or CPU?
It depends. First, H264 is just a Codec standard. While libx264 or openh264 are implementing this standard you can guess that each implementation supports different formats. But let's assume (as you did in your question) you are using the same implementation on different machines then yes there might be still cases where different machines support different formats. Take H264_AMF for example. You will need an AMD graphics card to use the codec and the supported formats will depend on your graphics card as well.
Decoding will generally happen on your CPU unless you explicitly specify a hardware decoder. See this example for Hardware decoding: https://github.com/FFmpeg/FFmpeg/blob/release/4.1/doc/examples/hw_decode.c
When using Hardware decoding you are heavily relying on your machine. And each Hardware encoder will output their own (proprietary) frame format e.g. NV12 for a Nvida Graphics Card. Now comes the tricky part. The encoded frames will remain on your GPU memory which means you might be able to reuse the avframe buffer to do the pixel conversion using OpenCL/GL. But achieving GPU zero-copy when working with different frameworks is not that easy and I don't have enough knowledge to help you there. So what I would do is to download the decoded frame from the GPU via av_hwframe_transfer_data like in the example.
From this point on it doesn't make much of a difference if you used hardware or software decoding.
To my knowledge sws_scale isn't using hardware acceleration. Since it's not accepting "hwframes". If you want to do color conversion on Hardware Level you might wanna take a look at OpenCV you can use GPUMat there then upload your frame, call cvtColor and download it again.
Some general remarks:
Almost any image operation scaling etc. is faster on your GPU, but uploading and downloading the data can take ages. For single operations, it's often not worth using your GPU.
In your case, I would try to work with CPU decoding and CPU color conversion first. Just make sure to use well threaded and vectorized algorithms like OpenCV or Intel IPP. If you still lack performance then you can think about Hardware Acceleration.
Related
I had to create a system that can process images in realtime. I have implemented in C++ a pixel format conversion system that can also do some simple transformation (currently: rotation & mirroring).
Input/output format of the system are frame in a the following formats:
RGB (24, 32)
YUYV420, YUYV 422
JPEG
Raw greyscale
For instance, one operation can be:
YUYV422 -> rotation 90 -> flip Horiz -> RGB24
Greyscale -> rotation 270 -> flip Vert -> YUYV420
The goal of the system is to offer best performance for rotation/mirroring and pixel format conversion. My current implementation rely on OpenCV, but I suffer from performance issues when processing data above 2k resolutions.
The current implementation uses cv::Mat and cv::transpose/cv::flip/cv::cvtColor, and I optimized the system to remove transitionnal buffers and copy as much as possible.
Not very happy to reinvent the wheel, I know that using swscale and some filters from FFMpeg, it is possible to achieve the same result. My question are:
The FFMpeg system is rather generic, do you think I might suffer from footprint/performance caveat with this solution?
Format conversion seems somewhat ooptimized in OpenCV, but I have no idea about FFMpeg implementation... (note: I'm on x86_64 intel platform with SSE)
Do you know any library than can handle this kind of simple transformation for real time?
Thank you
OpenCV implementation is optimised for your configuration. Don't expect improvements from ffmpeg. Recently, OpenCV switched to libjpeg-turbo with SSE optimizations, this may improve JPEG conversions.
I have written a C/C++ implementation of what I term a "compositor" (I come from a video background) to composite/overlay video/graphics on the top of a video source. My current compositor implementation is rather naive and there is room for CPU optimization improvements (ex: SIMD, threading, etc).
I've created a high-level diagram of what I am currently doing:
The diagram is self explanatory. Nonetheless, I'll elaborate on some of the constraints:
The main video always comes served in an 8-bit YUV 4:2:2 packed format
The secondary video (optional) will come served in either an 8-bit YUV 4:2:2 or YUVA 4:2:2:4 packed format.
The output from the overlay must come out in an 8-bit YUV 4:2:2 packed format
Some other bits of information:
The number of graphics inputs will vary; it may (or may not) be a constant value.
The colour format of the Graphics can be pinned to either ARGB or YUVA format (ie. I can provide it as you see fit). At the moment, I pin it to YUVA to keep a consistent colour format.
The potential of using OpenGL and accompanying shaders is rather appealing:
No need to reinvent the wheel (in terms of actually performing the composition)
The possibility of using GPU where available.
My concern with using OpenGL is performance. Looking around on the web, it is my understanding that a YUV surface would be converted to RGB internally; I would like to minimize the number of colour format conversions and ensure optimal performance. Without prior OpenGL experience, I hope someone can shed some light and suggest if I'm about to venture down the wrong path.
Perhaps my concern relating to performance is less of an issue when using a dedicated GPU? Do I need to consider separate code paths:
Hardware with GPU(s)
Hardware with only CPU(s)?
Additionally, am I going to struggle when I need to process 10-bit YUV?
You should be able to treat YUV as independent channels throughout. OpenGL shaders will be calling them r, g, and b, but it's just data that can be treated as whatever you want.
Most GPUs will support 10 bits per channel (+ 2 alpha bits). Various will support 16 bits per channel for all 4 channels but I'm a little rusty here so I have no idea how common support is for this. Not sure about the 4:2:2 data, but you can always treat it as 3 separate surfaces.
The number of graphics inputs will vary; it may (or may not) be a constant value.
This is something I'm a little less sure about. Shaders like this to be predictable. If your implementation allows you to add each input iteratively then you should be fine.
As an alternative suggestion, have you looked into OpenCL?
I am developing mobile games on windows. Our image resources are in the PVR-TC 4 format. When we run our game on simulator, images are decoded by CPU which is really slow, as our PC graphic card don't support GPU decode. Is it possible to make PC OpenGL support PVR-TC or ETC hardware decode?
You cannot force an implementation to implement a particular extension or image format.
Your best bet is to convert the images yourself offline. That is, instead of loading images of a format your hardware can't handle, load images of the format that it can.
After all, it's not like the images are originally in PVRTC format, right? They were originally authored in a regular format like PNG or whatever, then converted to PVRTC. So just add another conversion for S3TC or whatever format desktop hardware actually supports.
My cam gives me jpeg with chroma sub-sampling 4:2:2, but I need 4:2:0.
Can I change MJPEG default chroma sub-sampling with v4l2?
v4l2 itself provides a very thin layer around the actual video data that is transferred: it will simply give you the formats that the camera (the hardware!!) delivers.
so if your hardware offers two distinct formats, then there is no way that v4l2 will offer you anything else.
you might want to checkout out libv4l2 library that does some basic colorspace conversion: in general it will have conversion from most exotic hardware formats to a handful of "standard" formats, so your application does not need to support all formats any hardware manufacturer can come up with. however, it is not very likely that these standard formats include a very specific (compressed) format, like the one you need.
Decoding a png image of size 1080x1920 takes over 30ms and I'm looking to do it faster.
In Android BitmapFactory has a method where you can pass the sample size to be returned. This causes the returned decoded image to be smaller then the actual source. This in turn makes the decoding process a lot faster with the outcome of a lower quality image.
I want to do something similar in c++ using some png decoding library such as libpng but for some reason I can't find any details about decoding at a lower quality.
Any pointers or ideas to improve decoding time would be appreciated!
To ask for a lower resolution image in the decoding would have zero influence in CPU work: a PNG stream is basically a compressed ZLIB stream which must be fully decompressed, and inside that there is a PNG-specific unfiltering to be done, which, again, requires all the neighbouring pixels. Of course, subsampling could lead to less memory usage, (which in itself can result in less decoding time), for this you'd need to decode the PNG progressively (so that the subsampling is done line by line); you can do that with (my) Java library PNGJ; it's optimized for that usage pattern, and some people have used succesfully in Android.
If you want to do it in C, with libpng, the idea would be the same. Decode the image progresively, line by line, and do the subsampling yourself.
Bear in mind that this usage pattern would break with interlaced PNG (in that case, you'd want to decode one of the subimages), but, anyway, to store a 1080x1920 image as interlaced PNG would be a bad idea.
Android is open source; you could look at the source: the Java interface and the C++ backend - from there, it calls to the SKIA library.
This class appears to be where the sampling is done; it is called from here.