GraphicsMagick TTF Font Performance - c++

I am using GraphicsMagick in a C++ library in order to create rasterized output, which mainly consists out of text.
I am doing something like this:
void gfx_writer::add_text(Magick::Image& img) const
{
using namespace Magick;
const unsigned x = // just a position;
const unsigned y_title = // just a position;
const unsigned y_heading = // just a position;
const unsigned y_value = // just a position;
img.strokeColor("transparent");
img.fillColor("black");
img.font(font_title_);
img.fontPointsize(font_size_title_);
img.draw(DrawableText{static_cast<double>(x), static_cast<double>(y_title), "a text"});
img.font(font_heading_);
img.fontPointsize(font_size_heading_);
img.draw(DrawableText{static_cast<double>(x), static_cast<double>(y_heading), "another text"});
img.font(font_value_);
img.fontPointsize(font_size_value_);
img.draw(DrawableText{static_cast<double>(x), static_cast<double>(y_value), "third text"});
}
Whereas font_title_, font_heading_ and font_value_ are paths to the TTF files.
This is done more than once and I experience rather bad performance. When I have a look at what happens using Sysinternals Process Monitor I see that the TTF files are read over and over again. So my questions are:
are my observations correct, that the TTF files are read each time img.font(...) is called?
is there a way to somehow cache the font using GraphicsMagick OR to provided something else than just the path to the TTF file?
any other thing I am missing?

Note: This answer uses ImageMagick's Magick++ library, and may have minor portability issues with GraphicsMagick, but the underlying solution is the same.
are my observations correct, that the TTF files are read each time img.font(...) is called?
Yes, the TTF font is reloaded each time. One option is to install the fonts in the system, and call the font-family constructor.
DrawableFont ( const std::string &family_,
StyleType style_,
const unsigned long weight_,
StretchType stretch_ );
Most systems have some sort of font caching system that would allow quicker access, but not really noticeable on modern hardware.
any other thing I am missing?
Try building a graphical context, and only call Magick::Image.draw once. Remember that the Drawable... calls are only wrapping MVG statements, and creating a std::list<Drawable> allows you to build complex vectors. Only when the draw method consumes the draw commands is when the TTF will be loaded, so its key to prepare all the drawing commands ahead of time.
Let's start by rewriting the code you provided (and I'm taking a degree of liberty here).
#include <Magick++.h>
const char * font_title_ = "fonts/OpenSans-Regular.ttf";
const char * font_heading_ = "fonts/LiberationMono-Regular.ttf";
const char * font_value_ = "fonts/sansation.ttf";
double font_size_title_ = 32;
double font_size_heading_ = 24;
double font_size_value_ = 16;
void gfx_writer_add_text(Magick::Image& img)
{
using namespace Magick;
double x = 10.0;
double y_title = 10;
double y_heading = 20.0;
double y_value = 30.0;
img.strokeColor("transparent");
img.fillColor("black");
img.font(font_title_);
img.fontPointsize(font_size_title_);
img.draw(DrawableText{x, y_title, "a text"});
img.font(font_heading_);
img.fontPointsize(font_size_heading_);
img.draw(DrawableText{x, y_heading, "another text"});
img.font(font_value_);
img.fontPointsize(font_size_value_);
img.draw(DrawableText{x, y_value, "third text"});
}
int main()
{
Magick::Image img("wizard:");
gfx_writer_add_text(img);
gfx_writer_add_text(img);
gfx_writer_add_text(img);
img.write("output.png");
}
I can compile and benchmark the run time. I get the following times:
$ time ./original.o
real 0m5.061s
user 0m0.094s
sys 0m0.029s
Refactoring the code to use a drawing context, and only call Magick::Image.draw once.
#include <Magick++.h>
#include <list>
const char * font_title_ = "fonts/OpenSans-Regular.ttf";
const char * font_heading_ = "fonts/LiberationMono-Regular.ttf";
const char * font_value_ = "fonts/sansation.ttf";
double font_size_title_ = 32;
double font_size_heading_ = 24;
double font_size_value_ = 16;
void gfx_writer_add_text(Magick::Image& img)
{
using namespace Magick;
double x = 10.0;
double y_title = 10;
double y_heading = 20.0;
double y_value = 30.0;
std::list<Drawable> ctx;
ctx.push_back(DrawableStrokeColor("transparent"));
ctx.push_back(DrawableFillColor("black"));
/* TITLE */
ctx.push_back(DrawablePushGraphicContext());
ctx.push_back(DrawableFont(font_title_);
ctx.push_back(DrawablePointSize(font_size_title_));
ctx.push_back(DrawableText{x, y_title, "a text"});
ctx.push_back(DrawablePopGraphicContext());
/* HEADING */
ctx.push_back(DrawablePushGraphicContext());
ctx.push_back(DrawableFont(font_heading_));
ctx.push_back(DrawablePointSize(font_size_heading_));
ctx.push_back(DrawableText{x, y_heading, "another text"});
ctx.push_back(DrawablePopGraphicContext());
/* Value */
ctx.push_back(DrawablePushGraphicContext());
ctx.push_back(DrawableFont(font_value_));
ctx.push_back(DrawablePointSize(font_size_value_));
ctx.push_back(DrawableText{x, y_value, "third text"});
ctx.push_back(DrawablePopGraphicContext());
img.draw(ctx);
}
int main()
{
Magick::Image img("wizard:");
gfx_writer_add_text(img);
gfx_writer_add_text(img);
gfx_writer_add_text(img);
img.write("output2.png");
}
And the benchmark times are slightly better.
$ time ./with_context.o
real 0m0.106s
user 0m0.090s
sys 0m0.012s
This is done more than once and I experience rather bad performance.
Worth taking a step back, and asking: "How can a refactor my solution to only draw at the last possible moment?".

Related

Application for stitching bmp images together. Help needed

I'm supposed to create some code to stitch together N bmp images found in a folder. At the moment I just want to add the images together, side by side, don't care yet about common regions in them (I'm referring to how panoramic images are made).
I have tried to use some examples online for different functions that i need, examples which I've partially understood. I'm currently stuck because I can't really figure out what's wrong.
The basis of the bmp.h file is this page:
https://solarianprogrammer.com/2018/11/19/cpp-reading-writing-bmp-images/
I'm attaching my code and a screenshot of the exception VS throws.
main:
#include "bmp.h"
#include <fstream>
#include <iostream>
#include <filesystem>
namespace fs = std::filesystem;
int main() {
int totalImages = 0;
int width = 0;
int height;
int count = 0;
std::string path = "imagini";
//Here i count the total number of images in the directory.
//I need this to know the width of the composed image that i have to produce.
for (const auto & entry : fs::directory_iterator(path))
totalImages++;
//Here i thought about going through the directory and finding out the width of the images inside it.
//I haven't managed to think of a better way to do this (which is probably why it doesn't work, i guess).
//Ideally, i would have taken one image from the directory and multiply it's width
//by the total number of images in the said directory, thus getting the width of the resulting image i need.
for (auto& p : fs::directory_iterator(path))
{
std::string s = p.path().string();
const char* imageName = s.c_str();
BMP image(imageName);
width = width + image.bmp_info_header.width;
height = image.bmp_info_header.height;
}
BMP finalImage(width, height);
//Finally, I was going to pass the directory again, and for each image inside of it, i would call
//the create_data function that i wrote in bmp.h.
for (auto& p : fs::directory_iterator(path))
{
count++;
std::string s = p.path().string();
const char* imageName = s.c_str();
BMP image(imageName);
//I use count to point out which image the iterator is currently at.
finalImage.create_data(count, image, totalImages);
}
//Here i would write the finalImage to a bmp image.
finalImage.write("textura.bmp");
}
bmp.h (I have only added the part I wrote, the rest of the code is found at the link I've provided above):
// This is where I try to copy the pixel RGBA values from the image passed as parameter (from it's data vector) to my
// resulting image (it's data vector) .
// The math should be right, I've gone through it with pen&paper, but right now I can't test it because the code doesn't work for other reasons.
void create_data(int count, BMP image, int totalImages)
{
int w = image.bmp_info_header.width * 4;
int h = image.bmp_info_header.height;
int q = 0;
int channels = image.bmp_info_header.bit_count / 8;
int startJ = w * channels * (count - 1);
int finalI = w * channels * totalImages* (h - 1);
int incrementI = w * channels * totalImages;
for(int i = 0; i <= finalI; i+incrementI)
for (int j = i + startJ; j < i + startJ + w * channels; j+4)
{
data[j] =image.data[q];
data[j+1]=image.data[q+1];
data[j+2]=image.data[q+2];
data[j+3]=image.data[q+3];
q = q + 4;
}
}
Error I get: https://imgur.com/7fq9BH4
This is the first time I post a question, I've only looked up answers here. If I don't provide enough info to my problem, or something I've done is not ok I apologize.
Also, English is my second language, so I hope I got my points across pretty clear.
EDIT: Since I forgot to mention, I would like to do this code without using external libraries like OpenCV or ImageMagick.

c++, Linear interpolation between unsigned chars

Unexpectedly for me I faced strange issue:
Here is example of LOGICAL implementation of trivial linear Lagrange interpolation :
unsigned char mix(unsigned char x0, unsigned char x1, float position){
// LOGICALLY must be something like (real implementation should be
// with casts)...
return x0 + (x1 - x0) * position;
}
Arguments x0, x1 are always in range 0 - 255.
Argument position is always in range 0.0f - 1.0f.
Really I tried huge amount of implementations (with different casts and etc.) but it doesn't work in my case! It returns incorrect results (looks like variable overflow or something similar. After looking for solution in internet for a whole week i decided to ask. May be someone has faced similar issues.
I'm using MSVC 2017 compiler (most of parameters are default except language level).
OS - Windows 10x64, Little Endian.
What do i do wrong and what is possible source of the issue?
UPDATED:
It looks like this issue is more deep than I expected (thanks for your responses).
Here is the link to tiny github project which demonstrates my issue:
https://github.com/elRadiance/altitudeMapVisualiser
Output bmp-file should contain smooth altitude map. Instead of it, it contains garbage. If I use just x0 or x1 as result of interpolation function (without interpolation) it works. Without it - doesn't (produces garbage).
Desired result (as here, but in interpolated colors, smooth)
Actual result (updated, best result achieved)
Main class to run it:
#include "Visualiser.h"
int main() {
unsigned int width = 512;
unsigned int height = 512;
float* altitudes = new float[width * height];
float c;
for (int w = 0; w < width; w++) {
c = (2.0f * w / width) - 1.0f;
for (int h = 0; h < height; h++) {
altitudes[w*height + h] = c;
}
}
Visualiser::visualiseAltitudeMap("gggggggggg.bmp", width, height, altitudes);
delete(altitudes);
}
Thank you in advance!
SOLVED: Thankfully #wololo. Mistake in my project was not in calculations.
I should open file with option "binary":
file.open("test.bin", std::ios_base::out | std::ios_base::trunc | std::ios_base::binary);
Without it in some point in data can be faced byte with value 10
In Windows environment it can be processed like LineFeed and changed to 13.

C++ fast way to save image from array of values

Right now, I am using CImg.
I am unable to use OpenCV due to this issue.
My CImg code looks like this:
cimg_library::CImg<float> img(512,512);
cimg_forXYC(img,x,y,c) { img(x,y,c) = (array[x][y]); } //array contains all float values between 0/1
img.save(save.c_str()); //taking a lot of time
By using clocks I was able to determine that the first step, the for loop takes 0-0.01 seconds. However, the second step, the saving of the image, takes 0.06 seconds, which is way too long due to the amount of images I have.
I am saving as bitmaps.
Is there any faster way to accomplish the same things (creating an image from an array of values and save) in C++?
Here is a small function that will save your image in pgm format, which most things can read and is dead simple. It requires your compiler support C++11, which most do. It's also hard-coded to 512x512 images.
#include <fstream>
#include <string>
#include <cmath>
#include <cstdint>
void save_image(const ::std::string &name, float img_vals[][512])
{
using ::std::string;
using ::std::ios;
using ::std::ofstream;
typedef unsigned char pixval_t;
auto float_to_pixval = [](float img_val) -> pixval_t {
int tmpval = static_cast<int>(::std::floor(256 * img_val));
if (tmpval < 0) {
return 0u;
} else if (tmpval > 255) {
return 255u;
} else {
return tmpval & 0xffu;
}
};
auto as_pgm = [](const string &name) -> string {
if (! ((name.length() >= 4)
&& (name.substr(name.length() - 4, 4) == ".pgm")))
{
return name + ".pgm";
} else {
return name;
}
};
ofstream out(as_pgm(name), ios::binary | ios::out | ios::trunc);
out << "P5\n512 512\n255\n";
for (int x = 0; x < 512; ++x) {
for (int y = 0; y < 512; ++y) {
const pixval_t pixval = float_to_pixval(img_vals[x][y]);
const char outpv = static_cast<const char>(pixval);
out.write(&outpv, 1);
}
}
}
In a similar vein to #Omnifarious's answer, there is an extremely simple format (also based on NetPBM concepts) for float data such as yours. It is called PFM and is documented here.
The benefit is that both CImg and ImageMagick are able to read and write the format without any additional libraries, and without you needing to write any code! An additional benefit is that you retain the full tonal range of your floats, rather than just 256 steps. On the downside, you do need the full 4 bytes per pixel rather than 1 byte.
So, your code would become:
CImg<float> img(512,512);
cimg_forXYC(img,x,y,c) { img(x,y,c) = (array[x][y]); }
img.save_pfm("filename.pfm");
I benchmarked this by creating 10,000 images and saving them to disk with the following code:
#include <iostream>
#include <cstdlib>
#define cimg_display 0 // No need for X11 stuff
#include "CImg.h"
using namespace cimg_library;
using namespace std;
#define W 512
#define H 512
#define N 10000
int main() {
// Create and initialise float image with radial gradient
cimg_library::CImg<float> img(W,H);
cimg_forXY(img,x,y) {img(x,y) = hypot((float)(W/2-x),(float)(H/2-y)); }
char filename[128];
for(int i=0;i<N;i++){
sprintf(filename,"f-%06d.pfm",i);
img.save_pfm(filename);
}
}
It runs in 21.8 seconds, meaning 2.1 ms per image (0.002s).
As I mentioned earlier, ImageMagick is also able to handle PFM format, so you can then use GNU Parallel and ImageMagick mogrify to convert those images to JPEG:
parallel -X mogrify -format jpg -auto-level ::: *pfm
For the original 10,000 images, that takes 22 seconds, or 2.2 ms/image.

Halide with GPU (OpenGL) as Target - benchmarking and using HalideRuntimeOpenGL.h

I am new to Halide. I have been playing around with the tutorials to get a feel for the language. Now, I am writing a small demo app to run from command line on OSX.
My goal is to perform a pixel-by-pixel operation on an image, schedule it on the GPU and measure the performance. I have tried a couple things which I want to share here and have a few questions about the next steps.
First approach
I scheduled the algorithm on GPU with Target being OpenGL, but because I could not access the GPU memory to write to a file, in the Halide routine, I copied the output to the CPU by creating Func cpu_out similar to the glsl sample app in the Halide repo
pixel_operation_cpu_out.cpp
#include "Halide.h"
#include <stdio.h>
using namespace Halide;
const int _number_of_channels = 4;
int main(int argc, char** argv)
{
ImageParam input8(UInt(8), 3);
input8
.set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
.set_stride(2, 1); // stride in dimension 2 (c) is one
Var x("x"), y("y"), c("c");
// algorithm
Func input;
input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
clamp(y, input8.top(), input8.bottom()),
clamp(c, 0, _number_of_channels))) / 255.0f;
Func pixel_operation;
// calculate the corresponding value for input(x, y, c) after doing a
// pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
// This operation is not location dependent, eg: brighten
Func out;
out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
out.output_buffer()
.set_stride(0, _number_of_channels)
.set_stride(2, 1);
input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
out.output_buffer().set_bounds(2, 0, _number_of_channels);
// schedule
out.compute_root();
out.reorder(c, x, y)
.bound(c, 0, _number_of_channels)
.unroll(c);
// Schedule for GLSL
out.glsl(x, y, c);
Target target = get_target_from_environment();
target.set_feature(Target::OpenGL);
// create a cpu_out Func to copy over the data in Func out from GPU to CPU
std::vector<Argument> args = {input8};
Func cpu_out;
cpu_out(x, y, c) = out(x, y, c);
cpu_out.output_buffer()
.set_stride(0, _number_of_channels)
.set_stride(2, 1);
cpu_out.output_buffer().set_bounds(2, 0, _number_of_channels);
cpu_out.compile_to_file("pixel_operation_cpu_out", args, target);
return 0;
}
Since I compile this AOT, I make a function call in my main() for it. main() resides in another file.
main_file.cpp
Note: the Image class used here is the same as the one in this Halide sample app
int main()
{
char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);
Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
input.buf.host = &pixelsRGBA[0];
unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
output.buf.host = &outputPixelsRGBA[0];
double best = benchmark(100, 10, [&]() {
pixel_operation_cpu_out(&input.buf, &output.buf);
});
char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);
}
This works just fine and gives me the output I expect. From what I understand, cpu_out makes the values in out available on the CPU memory, which is why I am able to access these values by accessing output.buf.host in main_file.cpp
Second approach:
The second thing I tried was to not do the copy to host from device in the Halide schedule by creating Func cpu_out, instead using copy_to_host function in main_file.cpp.
pixel_operation_gpu_out.cpp
#include "Halide.h"
#include <stdio.h>
using namespace Halide;
const int _number_of_channels = 4;
int main(int argc, char** argv)
{
ImageParam input8(UInt(8), 3);
input8
.set_stride(0, _number_of_channels) // stride in dimension 0 (x) is three
.set_stride(2, 1); // stride in dimension 2 (c) is one
Var x("x"), y("y"), c("c");
// algorithm
Func input;
input(x, y, c) = cast<float>(input8(clamp(x, input8.left(), input8.right()),
clamp(y, input8.top(), input8.bottom()),
clamp(c, 0, _number_of_channels))) / 255.0f;
Func pixel_operation;
// calculate the corresponding value for input(x, y, c) after doing a
// pixel-wise operation on each each pixel. This gives us pixel_operation(x, y, c).
// This operation is not location dependent, eg: brighten
Func out;
out(x, y, c) = cast<uint8_t>(pixel_operation(x, y, c) * 255.0f + 0.5f);
out.output_buffer()
.set_stride(0, _number_of_channels)
.set_stride(2, 1);
input8.set_bounds(2, 0, _number_of_channels); // Dimension 2 (c) starts at 0 and has extent _number_of_channels.
out.output_buffer().set_bounds(2, 0, _number_of_channels);
// schedule
out.compute_root();
out.reorder(c, x, y)
.bound(c, 0, _number_of_channels)
.unroll(c);
// Schedule for GLSL
out.glsl(x, y, c);
Target target = get_target_from_environment();
target.set_feature(Target::OpenGL);
std::vector<Argument> args = {input8};
out.compile_to_file("pixel_operation_gpu_out", args, target);
return 0;
}
main_file.cpp
#include "pixel_operation_gpu_out.h"
#include "runtime/HalideRuntime.h"
int main()
{
char *encodeded_jpeg_input_buffer = read_from_jpeg_file("input_image.jpg");
unsigned char *pixelsRGBA = decompress_jpeg(encoded_jpeg_input_buffer);
Image input(width, height, channels, sizeof(uint8_t), Image::Interleaved);
Image output(width, height, channels, sizeof(uint8_t), Image::Interleaved);
input.buf.host = &pixelsRGBA[0];
unsigned char *outputPixelsRGBA = (unsigned char *)malloc(sizeof(unsigned char) * width * height * channels);
output.buf.host = &outputPixelsRGBA[0];
double best = benchmark(100, 10, [&]() {
pixel_operation_gpu_out(&input.buf, &output.buf);
});
int status = halide_copy_to_host(NULL, &output.buf);
char* encoded_jpeg_output_buffer = compress_jpeg(output.buf.host);
write_to_jpeg_file("output_image.jpg", encoded_jpeg_output_buffer);
return 0;
}
So, now, what I think is happening is that pixel_operation_gpu_out is keeping output.buf on the GPU and when I do copy_to_host, that's when I get the memory copied over to the CPU. This program gives me the expected output as well.
Questions:
The second approach is much slower than the first approach. The slow part is not in the benchmarked part though. For example, for first approach, I get 17ms as benchmarked time for a 4k image. For the same image, in the second approach, I get the benchmarked time as 22us and the time taken for copy_to_host is 10s. I'm not sure if this behavior is expected since both approach 1 and 2 are essentially doing the same thing.
The next thing I tried was to use [HalideRuntimeOpenGL.h][3] and link textures to input and output buffers to be able to draw directly to a OpenGL context from main_file.cpp instead of saving to a jpeg file. However, I could find no examples to figure out how to use the functions in HalideRuntimeOpenGL.h and whatever things I did try on my own were always giving me run time errors which I could not figure out how to solve. If anyone has any resources they can point me to, that will be great.
Also, any feedback on the code I have above are welcome too. I know it works and is doing what I want but it could be the completely wrong way of doing it and I wouldn't know any better.
Mostly likely the reason for the 10s to copy memory back is because the GPU API has queued all the kernel invocations and then waits on them to finish when halide_copy_to_host is called. You can call halide_device_sync inside the benchmark timing after running all the compute calls to handle get the compute time inside the loop without the copy back time.
I cannot tell from the code how many times the kernel is being run from this code. (My guess is 100, but it may be that those arguments to benchmark setup some sort of parameterization where it tries to run it as many times as need be to get significance. If so, that is a problem because the queuing call is really fast but the compute is of course async. If this is the case, you can do things like queue ten calls and then call halide_device_sync and play with the number "10" to get a real picture of how long it takes.)

how to automatically run a c++ code with different amounts for specific parameters

I have a code in c++ and I should separately run lots of versions from it by changing the values of two parameters (alpha and cost) I have.
Their versions are as follows:
for (int cost = 0; cost <= 100; cost+=5){
for(float alpha = 0.5; alpha<=2.5; alpha+=0.1){
I don't know how to make it happen and I searched a lot for this, but more of them were too complicated or not applicable for me. Thanks in advance for your help.
The structure of my code is kind of simple. I have two functions other than the main function. I am using Visual Studio 2012 in windows 7.
P.S. The computations are not by me, I will pass the exe file of my program to a cluster computer. In overall there should be 400 sets of different versions and I need 5 repetitions of each.
Here what I finally found and worked for me:
int main(int argc, char const *argv[]){
for (int cost = 0; cost <= 100; cost+=5){
for(float alpha = 0.5; alpha<=2.5; alpha+=0.1){
string s1 = to_string(cost);
char const *pchar1 = s1.c_str();
argv[1] = pchar1;
string s2 = to_string(alpha);
char const *pchar2 = s2.c_str();
argv[2] = pchar2;
. . .
Not sure I understand exactly what you want, but following may help:
You have to call my_f(int argc, const char*argv[]) several times:
int main(int argc, char *argv[]){
for (int cost = 0; cost <= 100; cost += 5) {
for (float alpha = 0.5f; alpha <= 2.5f; alpha += 0.1f) {
const int myargc = 2;
const std::string scost = std::to_string(cost);
const std::string salpha = std::to_string(alpha);
const char* myargv[] = { scost.c_str(), salpha.c_str() };
my_f(myargc, myargv);
}
}
return 0;
}
You have to launch my_a.exe cost alpha
I suggest to use a shell script to launch an application several time with different parameters.
Else in C++, You have to use fork/exec...