mjpeg to raw rgb24 with video4linux - c++

I'm writing a c++ webcam viewer using video4linux. I need a RGB24 output (interleaved R8B8G8) for displaying. I'm able to get video input for almost all low-resolution webcam, using YUYV, GREY8 or RGB24. But I need to get input also from high-resolution webcams, that use MJPEG for compression when high framerate is needed.
I'm able to get MJPEG stream using V4L2_PIX_FMT_MJPEG as pixel format, but received framebuffer is compressed.
How can I quickly convert it to RGB24?
Can I use libjpeg for this?

The quickest solution I've found is decode_jpeg_raw from mjpegtools which decode jpeg data to planar YUV420. Then the conversion from yuv420 to rgb24 is done by this function:
inline int clip(int value) {
return (value > 255) ? 255 : (value < 0) ? 0 : value;
}
static void yuv420_to_rgb24(
/* luminance (source) */const uint8_t* const y
/* u chrominance (source) */, const uint8_t* u
/* v chrominance (source) */, const uint8_t* v
/* rgb interleaved (destination) */, uint8_t* const dst
/* jpeg size */, int const size
/* image width */, int const width) {
const int lineSize = width * 3;
uint8_t* r1 = dst;
uint8_t* g1 = r1 + 1;
uint8_t* b1 = r1 + 2;
uint8_t* r2 = r1 + lineSize;
uint8_t* g2 = r2 + 1;
uint8_t* b2 = r2 + 2;
const uint8_t* y1 = y;
const uint8_t* y2 = y + width;
uint8_t* const end = r1 + size;
int c1 = 0;
int c2 = 0;
int e = 0;
int d = 0;
while (r1 != end) {
uint8_t* const lineEnd = r2;
/* line by line */
while (r1 != lineEnd) {
/* first pixel */
c1 = *y1 - 16;
c2 = *y2 - 16;
d = *u - 128;
e = *v - 128;
*r1 = clip(c1 + ((454 * e) >> 8));
*g1 = clip(c1 - ((88 * e + 183 * d) >> 8));
*b1 = clip(c1 + ((359 * d) >> 8));
*r2 = clip(c2 + ((454 * e) >> 8));
*g2 = clip(c2 - ((88 * e + 183 * d) >> 8));
*b2 = clip(c2 + ((359 * d) >> 8));
r1 += 3;
g1 += 3;
b1 += 3;
r2 += 3;
g2 += 3;
b2 += 3;
++y1;
++y2;
/* second pixel */
c1 = *y1 - 16;
c2 = *y2 - 16;
d = *u - 128;
e = *v - 128;
*r1 = clip(c1 + ((454 * e) >> 8));
*g1 = clip(c1 - ((88 * e + 183 * d) >> 8));
*b1 = clip(c1 + ((359 * d) >> 8));
*r2 = clip(c2 + ((454 * e) >> 8));
*g2 = clip(c2 - ((88 * e + 183 * d) >> 8));
*b2 = clip(c2 + ((359 * d) >> 8));
r1 += 3;
g1 += 3;
b1 += 3;
r2 += 3;
g2 += 3;
b2 += 3;
++y1;
++y2;
++u;
++v;
}
r1 += lineSize;
g1 += lineSize;
b1 += lineSize;
r2 += lineSize;
g2 += lineSize;
b2 += lineSize;
y1 += width;
y2 += width;
}
}

Yes you can use libjpeg for this, but usually the output of libjpeg is in YUV420 or YUV422.
You might instead use that code: http://mxhaard.free.fr/spca50x/Download/gspcav1-20071224.tar.gz (check for decoder source, there's a small jpeg decoder that's working well and deals with color conversion directly so the output is in RGB888)

Related

Converting RGBA binary image into YCbCr but the result is not as expected

I have a RGBA image in binary format(.raw) and I am trying to convert the image into YCbCr using C++. However the converted image when viewed using ffplay give me a green image. What could I be doing wrong? I have a code to reproduce the problem I am facing. The input image looks like this: https://drive.google.com/file/d/1oDswYmmSV0pfNe-u8Do06WWVu2v1B-rg/view?usp=sharing and the snapshot of the converted image is https://drive.google.com/file/d/1G8Rut3CXILqbmlGrFQsnushy2CLKu40w/view?usp=sharing. The input RGBA .raw image can be obtained here: https://drive.google.com/file/d/19JhMjRdibGCgaUsE6DBGAXGTRiT2bmTM/view?usp=sharing
#include <fstream>
#include <iostream>
#include <vector>
#include <array>
typedef unsinged char byte;
int main(){
std::ifstream infile;
std::ofstream outfile;
const unsigned width = 1280;
const unsigned height = 720;
std::vector<std::array<byte, 4>> imageBuffer;
std::vector<std::array<byte, 3>> output;
imageBuffer.resize(width*height);
output.resize(width*height);
infile.open("input.raw", std::ios::binary);
if(infile){
infile.read(reinterpret_cast<char*>(&imageBuffer[0]), width*height*4*sizeof(char));
}
for (unsigned y=0; y<height; ++y){
for(unsigned x=0; x<width; ++x){
byte R, G, B, A;
R = imageBuffer[y*width + x][0];
G = imageBuffer[y*width + x][1];
B = imageBuffer[y*width + x][2];
byte Y, Cb, Cr;
Y = 0.257*R + 0.504*G + 0.098*B + 16;
Cb = -0.148*R - 0.291*G + 0.439*B + 128;
Cr = 0.439*R - 0.368*G - 0.071*B + 128;
output[y*width + x][0] = Y;
output[y*width + x][1] = Cb;
output[y*width + x][2] = Cr;
}
}
std::ofstream os("output444.yuv", std::ios::binary);
if(!os)
return false;
os.write(reinterpret_cast<char*>(&output[0]), 1280*720*3*sizeof(char));}
Your code is fine for YUV_4_4_4 8-bit-Packed.
You can view it with YUView: https://github.com/IENT/YUView/releases
and select the settings:
It will display just fine.
However, if you are seeing it Green or whatever wrong colours, it means the program reading it is expecting a different format. Most likely it is expecting planar format which means you need to write all Y bytes first. Then write Cb bytes, then Cr bytes.
So it'd look like (YCbCr_4_4_4_Planar):
YYYY
YYYY
YYYY
CbCbCbCb
CbCbCbCb
CrCrCrCr
CrCrCrCr
instead of packed which looks like (Your code above = YCbCr_4_4_4_Packed/Interleaved):
YCbCrYCbCrYCbCr
YCbCrYCbCrYCbCr
YCbCrYCbCrYCbCr
YCbCrYCbCrYCbCr
Below I wrote some code that can handle multiple formats. It'll take a RAW image format and convert it to either:
YUV_4_2_2_PLANAR,
YUV_4_2_2_PACKED,
YUV_4_4_4_PLANAR,
YUV_4_4_4_PACKED,
//
// main.cpp
// RAW-To-YUV-Conversion
//
// Created by Brandon on 2021-08-06.
//
#include <iostream>
#include <fstream>
#include <utility>
#include <memory>
#include <vector>
void RGBToYUV(std::uint8_t R, std::uint8_t G, std::uint8_t B, std::uint8_t& Y, std::uint8_t& U, std::uint8_t& V)
{
Y = 0.257 * R + 0.504 * G + 0.098 * B + 16;
U = -0.148 * R - 0.291 * G + 0.439 * B + 128;
V = 0.439 * R - 0.368 * G - 0.071 * B + 128;
}
//void RGBToYUV(std::uint8_t R, std::uint8_t G, std::uint8_t B, std::uint8_t &Y, std::uint8_t &U, std::uint8_t &V)
//{
// #define RGB2Y(r, g, b) (uint8_t)(((66 * (r) + 129 * (g) + 25 * (b) + 128) >> 8) + 16)
// #define RGB2U(r, g, b) (uint8_t)(((-38 * (r) - 74 * (g) + 112 * (b) + 128) >> 8) + 128)
// #define RGB2V(r, g, b) (uint8_t)(((112 * (r) - 94 * (g) - 18 * (b) + 128) >> 8) + 128)
//
// Y = RGB2Y((int)R, (int)G, (int)B);
// U = RGB2U((int)R, (int)G, (int)B);
// V = RGB2V((int)R, (int)G, (int)B);
//}
enum Format
{
YUV_4_2_2_PLANAR,
YUV_4_2_2_PACKED,
YUV_4_4_4_PLANAR,
YUV_4_4_4_PACKED,
};
class RawImage
{
private:
std::unique_ptr<std::uint8_t> pixels;
std::uint32_t width, height;
std::uint16_t bpp;
public:
RawImage(const char* path, std::uint32_t width, std::uint32_t height);
~RawImage() {}
void SaveYUV(const char* path, Format format);
};
RawImage::RawImage(const char* path, std::uint32_t width, std::uint32_t height) : pixels(nullptr), width(width), height(height), bpp(32)
{
std::ifstream file(path, std::ios::in | std::ios::binary);
if (file)
{
std::size_t size = width * height * 4;
file.seekg(0, std::ios::beg);
pixels.reset(new std::uint8_t[size]);
file.read(reinterpret_cast<char*>(pixels.get()), size);
}
}
void RawImage::SaveYUV(const char* path, Format format)
{
std::ofstream file(path, std::ios::out | std::ios::binary);
if (file)
{
if (format == Format::YUV_4_2_2_PLANAR)
{
std::unique_ptr<std::uint8_t> y_plane{new std::uint8_t[width * height]};
std::unique_ptr<std::uint8_t> u_plane{new std::uint8_t[(width * height) >> 1]};
std::unique_ptr<std::uint8_t> v_plane{new std::uint8_t[(width * height) >> 1]};
std::uint8_t* in = pixels.get();
std::uint8_t* y_plane_ptr = y_plane.get();
std::uint8_t* u_plane_ptr = u_plane.get();
std::uint8_t* v_plane_ptr = v_plane.get();
for (std::uint32_t i = 0; i < height; ++i)
{
for (std::uint32_t j = 0; j < width; j += 2)
{
std::uint32_t offset = 4;
std::size_t in_pos = i * (width * offset) + offset * j;
std::uint8_t Y1 = 0;
std::uint8_t U1 = 0;
std::uint8_t V1 = 0;
std::uint8_t Y2 = 0;
std::uint8_t U2 = 0;
std::uint8_t V2 = 0;
RGBToYUV(in[in_pos + 0], in[in_pos + 1], in[in_pos + 2], Y1, U1, V1);
RGBToYUV(in[in_pos + 4], in[in_pos + 5], in[in_pos + 6], Y2, U2, V2);
std::uint8_t U3 = (U1 + U2 + 1) >> 1;
std::uint8_t V3 = (V1 + V2 + 1) >> 1;
*y_plane_ptr++ = Y1;
*y_plane_ptr++ = Y2;
*u_plane_ptr++ = U3;
*v_plane_ptr++ = V3;
}
}
file.write(reinterpret_cast<char*>(y_plane.get()), width * height);
file.write(reinterpret_cast<char*>(u_plane.get()), (width * height) >> 1);
file.write(reinterpret_cast<char*>(v_plane.get()), (width * height) >> 1);
}
else if (format == Format::YUV_4_2_2_PACKED)
{
std::size_t size = width * height * 2;
std::unique_ptr<std::uint8_t> buffer{new std::uint8_t[size]};
std::uint8_t* in = pixels.get();
std::uint8_t* out = buffer.get();
for (std::uint32_t i = 0; i < height; ++i)
{
for (std::uint32_t j = 0; j < width; j += 2)
{
std::uint32_t offset = 4;
std::size_t in_pos = i * (width * offset) + offset * j;
std::uint8_t Y1 = 0;
std::uint8_t U1 = 0;
std::uint8_t V1 = 0;
std::uint8_t Y2 = 0;
std::uint8_t U2 = 0;
std::uint8_t V2 = 0;
RGBToYUV(in[in_pos + 0], in[in_pos + 1], in[in_pos + 2], Y1, U1, V1);
RGBToYUV(in[in_pos + 4], in[in_pos + 5], in[in_pos + 6], Y2, U2, V2);
std::uint8_t U3 = (U1 + U2 + 1) >> 1;
std::uint8_t V3 = (V1 + V2 + 1) >> 1;
std::size_t out_pos = i * (width * 2) + 2 * j;
out[out_pos + 0] = Y1;
out[out_pos + 1] = U3;
out[out_pos + 2] = Y2;
out[out_pos + 3] = V3;
}
}
file.write(reinterpret_cast<char*>(buffer.get()), size);
}
else if (format == Format::YUV_4_4_4_PLANAR)
{
std::size_t size = width * height * 3;
std::unique_ptr<std::uint8_t> buffer{new std::uint8_t[size]};
std::uint8_t* in = pixels.get();
std::uint8_t* out = buffer.get();
for (std::uint32_t i = 0; i < height; ++i)
{
for (std::uint32_t j = 0; j < width; ++j)
{
std::uint32_t offset = 4;
std::size_t in_pos = i * (width * offset) + offset * j;
std::uint8_t Y = 0;
std::uint8_t U = 0;
std::uint8_t V = 0;
RGBToYUV(in[in_pos + 0], in[in_pos + 1], in[in_pos + 2], Y, U, V);
std::size_t y_pos = i * width + j;
std::size_t u_pos = y_pos + (width * height);
std::size_t v_pos = y_pos + (width * height * 2);
out[y_pos] = Y;
out[u_pos] = U;
out[v_pos] = V;
}
}
file.write(reinterpret_cast<char*>(buffer.get()), size);
}
else if (format == Format::YUV_4_4_4_PACKED)
{
std::size_t size = width * height * 3;
std::unique_ptr<std::uint8_t> buffer{new std::uint8_t[size]};
std::uint8_t* in = pixels.get();
std::uint8_t* out = buffer.get();
for (std::uint32_t i = 0; i < height; ++i)
{
for (std::uint32_t j = 0; j < width; ++j)
{
std::uint32_t offset = 4;
std::size_t in_pos = i * (width * offset) + offset * j;
std::uint8_t Y = 0;
std::uint8_t U = 0;
std::uint8_t V = 0;
RGBToYUV(in[in_pos + 0], in[in_pos + 1], in[in_pos + 2], Y, U, V);
std::size_t out_pos = i * (width * 3) + 3 * j;
out[out_pos + 0] = Y;
out[out_pos + 1] = U;
out[out_pos + 2] = V;
}
}
file.write(reinterpret_cast<char*>(buffer.get()), size);
}
}
}
int main(int argc, const char * argv[]) {
RawImage img{"/Users/brandon/Downloads/input.raw", 1280, 720};
img.SaveYUV("/Users/brandon/Downloads/output.yuv", Format::YUV_4_4_4_PACKED);
return 0;
}
You are overwriting the same byte here:
output[y*width + x][0] = Y;
output[y*width + x][0] = Cb;
output[y*width + x][0] = Cr;

SSE mean filter in c++ and OpenCV

I would like to modify the code for an OpenCV mean filter to use Intel intrinsics. I'm an SSE newbie and I really don't know where to start from. I checked a lot of resources on the web, but I didn't have a lot of success.
This is the program:
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
using namespace cv;
using namespace std;
int main()
{
int A[3][3] = { { 1, 1, 1 }, { 1, 1, 1 }, { 1, 1, 1 } };
int c = 0;
int d = 0;
Mat var1 = imread("images.jpg", 1);
Mat var2(var1.rows, var1.cols, CV_8UC3, Scalar(0, 0, 0));
for (int i = 0; i < var1.rows; i++)
{
var2.at<Vec3b>(i, 0) = var1.at<Vec3b>(i, 0);
var2.at<Vec3b>(i, var1.cols - 1) = var1.at<Vec3b>(i, var1.cols - 1);
}
for (int i = 0; i < var1.cols; i++)
{
var2.at<Vec3b>(0, i) = var1.at<Vec3b>(0, i);
var2.at<Vec3b>(var1.rows - 1, i) = var1.at<Vec3b>(var1.rows - 1, i);
}
for (int i = 0; i < var1.rows; i++) {
for (int j = 0; j < var1.cols; j++)
{
c = 0;
for (int m = i; m < var1.rows; m++, c++)
{
if (c < 3)
{
d = 0;
for (int n = j; n < var1.cols; n++, d++)
{
if (d < 3)
{
if ((i + 1) < var1.rows && (j + 1) < var1.cols)
{
var2.at<Vec3b>(i + 1, j + 1)[0] += var1.at<Vec3b>(m, n)[0] * A[m - i][n - j] / 9;
var2.at<Vec3b>(i + 1, j + 1)[1] += var1.at<Vec3b>(m, n)[1] * A[m - i][n - j] / 9;
var2.at<Vec3b>(i + 1, j + 1)[2] += var1.at<Vec3b>(m, n)[2] * A[m - i][n - j] / 9;
}
}
}
}
}
}
}
imshow("window1", var1);
imshow("window2", var2);
waitKey(0);
return(0);
}
The part that I find difficult is understanding how to convert the innermost 2 loops, where the mean value is computed. Any help will be greatly appreciated.
Just for fun, I thought it might be interesting to start with a naive implementation of a 3x3 mean filter and then optimise this incrementally, ending up with a SIMD (SSE) implementation, measuring the throughput improvement at each stage.
1 - Mean_3_3_ref - reference implementation
This is just a simple scalar implementation which we'll use as a baseline for throughput and for validating further implementations:
void Mean_3_3_ref(const Mat &image_in, Mat &image_out)
{
for (int y = 1; y < image_in.rows - 1; ++y)
{
for (int x = 1; x < image_in.cols - 1; ++x)
{
for (int c = 0; c < 3; ++c)
{
image_out.at<Vec3b>(y, x)[c] = (image_in.at<Vec3b>(y - 1, x - 1)[c] +
image_in.at<Vec3b>(y - 1, x )[c] +
image_in.at<Vec3b>(y - 1, x + 1)[c] +
image_in.at<Vec3b>(y , x - 1)[c] +
image_in.at<Vec3b>(y , x )[c] +
image_in.at<Vec3b>(y , x + 1)[c] +
image_in.at<Vec3b>(y + 1, x - 1)[c] +
image_in.at<Vec3b>(y + 1, x )[c] +
image_in.at<Vec3b>(y + 1, x + 1)[c] + 4) / 9;
}
}
}
}
2 - Mean_3_3_scalar - somewhat optimised scalar implementation
Exploit the redundancy in summing successive columns - we save the last two column sums so that we only need to calculate one new column sum (per channel) on each iteration:
void Mean_3_3_scalar(const Mat &image_in, Mat &image_out)
{
for (int y = 1; y < image_in.rows - 1; ++y)
{
int r_1, g_1, b_1;
int r0, g0, b0;
int r1, g1, b1;
r_1 = g_1 = b_1 = 0;
r0 = g0 = b0 = 0;
for (int yy = y - 1; yy <= y + 1; ++yy)
{
r_1 += image_in.at<Vec3b>(yy, 0)[0];
g_1 += image_in.at<Vec3b>(yy, 0)[1];
b_1 += image_in.at<Vec3b>(yy, 0)[2];
r0 += image_in.at<Vec3b>(yy, 1)[0];
g0 += image_in.at<Vec3b>(yy, 1)[1];
b0 += image_in.at<Vec3b>(yy, 1)[2];
}
for (int x = 1; x < image_in.cols - 1; ++x)
{
r1 = g1 = b1 = 0;
for (int yy = y - 1; yy <= y + 1; ++yy)
{
r1 += image_in.at<Vec3b>(yy, x + 1)[0];
g1 += image_in.at<Vec3b>(yy, x + 1)[1];
b1 += image_in.at<Vec3b>(yy, x + 1)[2];
}
image_out.at<Vec3b>(y, x)[0] = (r_1 + r0 + r1 + 4) / 9;
image_out.at<Vec3b>(y, x)[1] = (g_1 + g0 + g1 + 4) / 9;
image_out.at<Vec3b>(y, x)[2] = (b_1 + b0 + b1 + 4) / 9;
r_1 = r0;
g_1 = g0;
b_1 = b0;
r0 = r1;
g0 = g1;
b0 = b1;
}
}
}
3 - Mean_3_3_scalar_opt - further optimised scalar implementation
As per Mean_3_3_scalar, but also remove OpenCV overheads by caching pointers to each row that we are working on:
void Mean_3_3_scalar_opt(const Mat &image_in, Mat &image_out)
{
for (int y = 1; y < image_in.rows - 1; ++y)
{
const uint8_t * const input_1 = image_in.ptr(y - 1);
const uint8_t * const input0 = image_in.ptr(y);
const uint8_t * const input1 = image_in.ptr(y + 1);
uint8_t * const output = image_out.ptr(y);
int r_1 = input_1[0] + input0[0] + input1[0];
int g_1 = input_1[1] + input0[1] + input1[1];
int b_1 = input_1[2] + input0[2] + input1[2];
int r0 = input_1[3] + input0[3] + input1[3];
int g0 = input_1[4] + input0[4] + input1[4];
int b0 = input_1[5] + input0[5] + input1[5];
for (int x = 1; x < image_in.cols - 1; ++x)
{
int r1 = input_1[x * 3 + 3] + input0[x * 3 + 3] + input1[x * 3 + 3];
int g1 = input_1[x * 3 + 4] + input0[x * 3 + 4] + input1[x * 3 + 4];
int b1 = input_1[x * 3 + 5] + input0[x * 3 + 5] + input1[x * 3 + 5];
output[x * 3 ] = (r_1 + r0 + r1 + 4) / 9;
output[x * 3 + 1] = (g_1 + g0 + g1 + 4) / 9;
output[x * 3 + 2] = (b_1 + b0 + b1 + 4) / 9;
r_1 = r0;
g_1 = g0;
b_1 = b0;
r0 = r1;
g0 = g1;
b0 = b1;
}
}
}
4 - Mean_3_3_blur - leverage OpenCV's blur function
OpenCV has a function called blur, which is based on the function boxFilter, which is just another name for a mean filter. Since OpenCV code has been quite heavily optimised over the years (using SIMD in many cases), let's see if this makes a big improvement over our scalar code:
void Mean_3_3_blur(const Mat &image_in, Mat &image_out)
{
blur(image_in, image_out, Size(3, 3));
}
5 - Mean_3_3_SSE - SSE implementation
This a reasonably efficient SIMD implementation. It uses the same techniques as the scalar code above in order to eliminate redundancy in processing successive pixels:
#include <tmmintrin.h> // Note: requires SSSE3 (aka MNI)
inline void Load2(const ssize_t offset, const uint8_t* const src, __m128i& vh, __m128i& vl)
{
const __m128i v = _mm_loadu_si128((__m128i *)(src + offset));
vh = _mm_unpacklo_epi8(v, _mm_setzero_si128());
vl = _mm_unpackhi_epi8(v, _mm_setzero_si128());
}
inline void Store2(const ssize_t offset, uint8_t* const dest, const __m128i vh, const __m128i vl)
{
__m128i v = _mm_packus_epi16(vh, vl);
_mm_storeu_si128((__m128i *)(dest + offset), v);
}
template <int SHIFT> __m128i ShiftL(const __m128i v0, const __m128i v1) { return _mm_alignr_epi8(v1, v0, SHIFT * sizeof(short)); }
template <int SHIFT> __m128i ShiftR(const __m128i v0, const __m128i v1) { return _mm_alignr_epi8(v1, v0, 16 - SHIFT * sizeof(short)); }
template <int CHANNELS> void Mean_3_3_SSE_Impl(const Mat &image_in, Mat &image_out)
{
const int nx = image_in.cols;
const int ny = image_in.rows;
const int kx = 3 / 2; // x, y borders
const int ky = 3 / 2;
const int kScale = 3 * 3; // scale factor = total number of pixels in sum
const __m128i vkScale = _mm_set1_epi16((32768 + kScale / 2) / kScale);
const int nx0 = ((nx + kx) * CHANNELS + 15) & ~15; // round up total width to multiple of 16
int x, y;
for (y = ky; y < ny - ky; ++y)
{
const uint8_t * const input_1 = image_in.ptr(y - 1);
const uint8_t * const input0 = image_in.ptr(y);
const uint8_t * const input1 = image_in.ptr(y + 1);
uint8_t * const output = image_out.ptr(y);
__m128i vsuml_1, vsumh0, vsuml0;
__m128i vh, vl;
vsuml_1 = _mm_set1_epi16(0);
Load2(0, input_1, vsumh0, vsuml0);
Load2(0, input0, vh, vl);
vsumh0 = _mm_add_epi16(vsumh0, vh);
vsuml0 = _mm_add_epi16(vsuml0, vl);
Load2(0, input1, vh, vl);
vsumh0 = _mm_add_epi16(vsumh0, vh);
vsuml0 = _mm_add_epi16(vsuml0, vl);
for (x = 0; x < nx0; x += 16)
{
__m128i vsumh1, vsuml1, vsumh, vsuml;
Load2((x + 16), input_1, vsumh1, vsuml1);
Load2((x + 16), input0, vh, vl);
vsumh1 = _mm_add_epi16(vsumh1, vh);
vsuml1 = _mm_add_epi16(vsuml1, vl);
Load2((x + 16), input1, vh, vl);
vsumh1 = _mm_add_epi16(vsumh1, vh);
vsuml1 = _mm_add_epi16(vsuml1, vl);
vsumh = _mm_add_epi16(vsumh0, ShiftR<CHANNELS>(vsuml_1, vsumh0));
vsuml = _mm_add_epi16(vsuml0, ShiftR<CHANNELS>(vsumh0, vsuml0));
vsumh = _mm_add_epi16(vsumh, ShiftL<CHANNELS>(vsumh0, vsuml0));
vsuml = _mm_add_epi16(vsuml, ShiftL<CHANNELS>(vsuml0, vsumh1));
// round mean
vsumh = _mm_mulhrs_epi16(vsumh, vkScale);
vsuml = _mm_mulhrs_epi16(vsuml, vkScale);
Store2(x, output, vsumh, vsuml);
vsuml_1 = vsuml0;
vsumh0 = vsumh1;
vsuml0 = vsuml1;
}
}
}
void Mean_3_3_SSE(const Mat &image_in, Mat &image_out)
{
const int channels = image_in.channels();
switch (channels)
{
case 1:
Mean_3_3_SSE_Impl<1>(image_in, image_out);
break;
case 3:
Mean_3_3_SSE_Impl<3>(image_in, image_out);
break;
default:
throw("Unsupported format.");
break;
}
}
Results
I benchmarked all of the above implementations on an 8th gen Core i9 (MacBook Pro 16,1) at 2.4 GHz, with an image size of 2337 rows x 3180 cols. The compiler was Apple clang version 12.0.5 (clang-1205.0.22.9) and the only optimisation switch was -O3. OpenCV version was 4.5.0 (via Homebrew). (Note: I verified that for Mean_3_3_blur the cv::blur function was dispatched to an AVX2 implementation.) The results:
Mean_3_3_ref 62153 µs
Mean_3_3_scalar 41144 µs = 1.51062x
Mean_3_3_scalar_opt 26238 µs = 2.36882x
Mean_3_3_blur 20121 µs = 3.08896x
Mean_3_3_SSE 4838 µs = 12.84680x
Notes
I have ignored the border pixels in all implementations - if required these can either be filled with pixels from the original image or using some other form of edge pixel processing.
The code is not "industrial strength" - it was only written for benchmarking purposes.
There are a few further possible optimisations, e.g. use wider SIMD (AVX2, AVX512), exploit the redundancy between successive rows, etc - these are left as an exercise for the reader.
The SSE implementation is fastest, but this comes at the cost of increased complexity, decreased mantainability and reduced portability.
The OpenCV blur function gives the second best performance, and should probably be the preferred solution if it meets throughput requirements - it's the simplest solution, and simple is good.

opencv mean function for gray image implemention via neon

uint64_t sum = 0;
const uint8_t* src = (const uint8_t*)gray.data;
int i, k, w = gray.cols, h = gray.rows, pitch = gray.step.p[0], w32 = w >> 5 << 5;
for(i = 0; i < h; ++ i)
{
const uint8_t* line = src;
for(k = 0; k < w32; k += 32)
{
uint8x16_t a16 = vld1q_u8(line); line += 16;
uint8x16_t b16 = vld1q_u8(line); line += 16;
uint16x8_t a8 = vpaddlq_u8(a16);
uint16x8_t b8 = vpaddlq_u8(b16);
uint32x4_t a4 = vpaddlq_u16(a8);
uint32x4_t b4 = vpaddlq_u16(b8);
uint64x2_t a2 = vpaddlq_u32(a4);
a2 = vpadalq_u32(a2, b4);
sum += vgetq_lane_u64(a2, 0) + vgetq_lane_u64(a2, 1);
}
for( ; k < w; ++ k)
sum += src[k];
src += pitch;
}
printf("%f\t%f", (double)sum / (double)(w * h), cv::mean(gray));
When I tested by a binary image, the result is equal.
When the image is grayscal, the result not equal, but I can't find the problem in my code.
Edit: now, upper code is right, the benchmark of loop 1000 times for a 8bit gray image is:
cv::mean 0.000328
myMean 0.000091

What is wrong in my RC6 implmentation?

can anyone see where i made a mistake here? I know that the algorithm will properly decrypt the encrypted data. however, most of the encrypted data is not the correct output, according to the RC6 paper.
// hexlify(string) turns a string into its hex representation: hexlify("AB") -> "4142"
// unhexlify(string) turns a string into its ASCII representation: unhexlify("4142") -> "AB"
// uint128_t is my own version of uint128, and Im pretty sure that the math is correct
// little_end(string, base) flips a string by bytes to get the little endian version of the string
// ROL/ROR(int, rotate x bits, bitsize of input int) does bitwise rotation
class RC6{
private:
unsigned int w, r, b, lgw;
std::vector <uint32_t> S;
uint128_t mod;
std::string mode;
void keygen(std::string KEY){
uint64_t p, q;
rc_pq(w, p, q);
KEY = hexlify(KEY);
unsigned int u = (unsigned int) ceil(w / 8.);
unsigned int c = (unsigned int) ceil(float(b) / u);
while ((KEY.size() >> 1) % u != 0)
KEY += zero;
std::vector <uint32_t> L;
for(unsigned int x = 0; x < c; x++)
L.push_back(toint(little_end(KEY.substr(2 * u * x, 2 * u), 16), 16));
S.push_back(p);
for(unsigned int i = 0; i < 2 * r + 3; i++)
S.push_back((S[i] + q) % mod);
uint32_t A = 0, B = 0, i = 0, j = 0;
uint32_t v = 3 * std::max(c, 2 * r + 4);
for(unsigned int s = 1; s < v + 1; s++){
A = S[i] = ROL((S[i] + A + B) % mod, 3, w);
B = L[j] = ROL((L[j] + A + B) % mod, (A + B) % w, w);
i = (i + 1) % (2 * r + 4);
j = (j + 1) % c;
}
}
public:
RC6(std::string KEY, std::string MODE, unsigned int W = 32, unsigned int R = 20, unsigned int B = 16){
w = W;
r = R;
b = B;
mod = uint128_t(1) << w;
lgw = (unsigned int) log2(w);
mode = MODE;
keygen(KEY);
}
std::string run(std::string DATA){
DATA = hexlify(DATA);
uint32_t A = toint(little_end(DATA.substr(0, 8), 16), 16), B = toint(little_end(DATA.substr(8, 8), 16), 16), C = toint(little_end(DATA.substr(16, 8), 16), 16), D = toint(little_end(DATA.substr(24, 8), 16), 16);
if (mode == "e"){
B += S[0];
D += S[1];
for(unsigned int i = 1; i < r + 1; i++){
uint64_t t = ROL((uint64_t) ((B * (2 * B + 1)) % mod), lgw, w);
uint64_t u = ROL((uint64_t) ((D * (2 * D + 1)) % mod), lgw, w);
A = ROL(A ^ t, u % w, w) + S[2 * i];
C = ROL(C ^ u, t % w, w) + S[2 * i + 1];
uint64_t temp = A; A = B % mod; B = C % mod; C = D % mod; D = temp % mod;
}
A += S[2 * r + 2];
C += S[2 * r + 3];
}
else{
C -= S[2 * r + 3];
A -= S[2 * r + 2];
for(int i = r; i > 0; i--){
uint64_t temp = D; D = C % mod; C = B % mod; B = A % mod; A = temp % mod;
uint64_t u = ROL((uint64_t) ((D * (2 * D + 1)) % mod), lgw, w);
uint64_t t = ROL((uint64_t) ((B * (2 * B + 1)) % mod), lgw, w);
C = ROR((C - S[2 * i + 1]) % mod, t % w, w) ^ u;
A = ROR((A - S[2 * i]) % mod, u % w, w) ^ t;
}
D -= S[1];
B -= S[0];
}
w >>= 2;
return unhexlify(little_end(makehex(A % mod, w)) + little_end(makehex(B % mod, w)) + little_end(makehex(C % mod, w)) + little_end(makehex(D % mod, w)));
}
};
of these tests vectors, only the first two are correct. the rest are not
data = "00000000000000000000000000000000";
key = "00000000000000000000000000000000";
ciphertext = "8fc3a53656b1f778c129df4e9848a41e";
data = "02132435465768798a9bacbdcedfe0f1";
key = "0123456789abcdef0112233445566778";
ciphertext = "524e192f4715c6231f51f6367ea43f18";
data = "00000000000000000000000000000000";
key = "000000000000000000000000000000000000000000000000";
ciphertext = "6cd61bcb190b30384e8a3f168690ae82";
data = "02132435465768798a9bacbdcedfe0f1";
key = "0123456789abcdef0112233445566778899aabbccddeeff0";
ciphertext = "688329d019e505041e52e92af95291d4";
data = "00000000000000000000000000000000";
key = "0000000000000000000000000000000000000000000000000000000000000000";
ciphertext = "8f5fbd0510d15fa893fa3fda6e857ec2";
data = "02132435465768798a9bacbdcedfe0f1";
key = "0123456789abcdef0112233445566778899aabbccddeeff01032547698badcfe";
ciphertext = "c8241816f0d7e48920ad16a1674e5d48";
did i mess up a uint somewhere? wrong little endian change?
I think I figured it out. Can anyone corroborate? I think that because I set b = 16 by default, I'm causing the errors. My harddrive is dead or I would have tested this already

How to optimize this code?

Profiler says that 50% of total time spends inside this function. How would you optimize it?
It converts BMP color scheme to YUV. Thanks!
Update: platform is ARMV6 (writing for IPhone)
#define Y_FROM_RGB(_r_,_g_,_b_) ( ( 66 * _b_ + 129 * _g_ + 25 * _r_ + 128) >> 8) + 16
#define V_FROM_RGB(_r_,_g_,_b_) ( ( 112 * _b_ - 94 * _g_ - 18 * _r_ + 128) >> 10) + 128
#define U_FROM_RGB(_r_,_g_,_b_) ( ( -38 * _b_ - 74 * _g_ + 112 * _r_ + 128) >> 10) + 128
/*!
* \brief
* Converts 24 bit image to YCrCb image channels
*
* \param source
* Source 24bit image pointer
*
* \param source_width
* Source image width
*
* \param dest_Y
* destination image Y component pointer
*
* \param dest_scan_size_Y
* destination image Y component line size
*
* \param dest_U
* destination image U component pointer
*
* \param dest_scan_size_U
* destination image U component line size
*
* \param dest_V
* destination image V component pointer
*
* \param dest_scan_size_V
* destination image V component line size
*
* \param dest_width
* Destination image width = source_width
*
* \param dest_height
* Destination image height = source image height
*
* Convert 24 bit image (source) with width (source_width)
* to YCrCb image channels (dest_Y, dest_U, dest_V) with size (dest_width)x(dest_height), and line size
* (dest_scan_size_Y, dest_scan_size_U, dest_scan_size_V) (in bytes)
*
*/
void ImageConvert_24_YUV420P(unsigned char * source, int source_width,
unsigned char * dest_Y, int dest_scan_size_Y,
unsigned char * dest_U, int dest_scan_size_U,
unsigned char * dest_V, int dest_scan_size_V,
int dest_width, int dest_height)
{
int source_scan_size = source_width*3;
int half_width = dest_width/2;
//Y loop
for (int y = 0; y < dest_height/2; y ++)
{
//Start of line
unsigned char * source_scan = source;
unsigned char * source_scan_next = source+source_scan_size;
unsigned char * dest_scan_Y = dest_Y;
unsigned char * dest_scan_U = dest_U;
unsigned char * dest_scan_V = dest_V;
//Do all pixels
for (int x = 0; x < half_width; x++)
{
int R = source_scan[0];
int G = source_scan[1];
int B = source_scan[2];
//Y
int Y = Y_FROM_RGB(B, G, R);
*dest_scan_Y = Y;
source_scan += 3;
dest_scan_Y += 1;
int R1 = source_scan[0];
int G1 = source_scan[1];
int B1 = source_scan[2];
//Y
Y = Y_FROM_RGB(B1, G1, R1);
R += (R1 + source_scan_next[0] + source_scan_next[3]);
G += (G1 + source_scan_next[1] + source_scan_next[4]);
B += (B1 + source_scan_next[2] + source_scan_next[5]);
//YCrCb
*dest_scan_Y = Y;
*dest_scan_V = V_FROM_RGB(B, G, R);
*dest_scan_U = U_FROM_RGB(B, G, R);
source_scan += 3;
dest_scan_Y += 1;
dest_scan_U += 1;
dest_scan_V += 1;
source_scan_next += 6;
};
//scroll to next line
source += source_scan_size;
dest_Y += dest_scan_size_Y;
dest_U += dest_scan_size_U;
dest_V += dest_scan_size_V;
//Start of line
source_scan = source;
dest_scan_Y = dest_Y;
//Do all pixels
for (int x = 0; x < half_width; x ++)
{
int R = source_scan[0];
int G = source_scan[1];
int B = source_scan[2];
//Y
int Y = Y_FROM_RGB(B, G, R);
*dest_scan_Y = Y;
source_scan += 3;
dest_scan_Y += 1;
R = source_scan[0];
G = source_scan[1];
B = source_scan[2];
//Y
Y = Y_FROM_RGB(B, G, R);
*dest_scan_Y = Y;
source_scan += 3;
dest_scan_Y += 1;
};
source += source_scan_size;
dest_Y += dest_scan_size_Y;
};
};
Unless I am missing something the follow code seems to be repeated in both loops, so, why not go through this loop once? This may require some changes to your algorithm, but it would improve performance.
for (int x = 0; x < half_width; x ++)
{
int R = source_scan[0];
int G = source_scan[1];
int B = source_scan[2];
//Y
int Y = Y_FROM_RGB(B, G, R);
*dest_scan_Y = Y;
source_scan += 3;
dest_scan_Y += 1;
R = source_scan[0];
G = source_scan[1];
B = source_scan[2];
But, before doing anything, move the two inside loops into separate functions, and then run your profiler, and see if you spend more time in one function than the other.
You have three loops in this function, and you don't know which section is actually where you are spending your time. So determine that before doing any optimization, otherwise you may find that you are fixing the wrong section.
I don't know what platform you are using but you might want to look SIMD
Arm Cotext-A8 has Neon technology that does support SIMD. You should be able to find more information on the ARM website.
Presuming that the memory they point to does not overlap, you should declare your source, dest_Y, dest_U and dest_V pointers with the restrict qualifier, to tell the compiler this and allow it to optimise better.