parallelize a video transformation program with tbb - c++

So, I am given a program in c++ and I have to parallelize it using TBB (make it faster). As I looked into the code I thought that using pipeline would make sense. The problem is that I have little experience and whatever I found on the web confused me even more. Here is the main part of the code:
uint64_t cbRaw=uint64_t(w)*h*bits/8;
std::vector<uint64_t> raw(cbRaw/8);
std::vector<uint32_t> pixels(w*h);
while(1){
if(!read_blob(STDIN_FILENO, cbRaw, &raw[0]))
break; // No more images
unpack_blob(w, h, bits, &raw[0], &pixels[0]);
process(levels, w, h, bits, pixels);
//invert(levels, w, h, bits, pixels);
pack_blob(w, h, bits, &pixels[0], &raw[0]);
write_blob(STDOUT_FILENO, cbRaw, &raw[0]);
}
It actually reads a video file, unpacks it, applies the transformation, packs it and then writes it to the output. It seems pretty straightforward, so if you have any ideas or resources that could be helpful please share.
Thanx in advance,
D. Christ.

Indeed you can use tbb::parallel_pipeline to process multiple video "blobs" in parallel.
The basic scheme is a 3-stage pipeline: an input filter reads a blob, a middle filter processes it, and the last one writes the processed blob into the file. The input and output filters should be serial_in_order, and the middle filter can be parallel. Unpacking and packing seemingly might be done in either the middle stage (I would start with that, to minimize the amount of work in the serial stages) or in the input & output stages (but that could be slower).
You will also need to ensure that the data storage (raw and pixels in your case) is not shared between concurrently processed blobs. Perhaps the easiest way is to have a per-blob storage which is passed through the pipeline. Unlike the serial program, it will impossible to use automatic variables for the storage that needs to be passed between pipeline stages; thus, you will need to allocate your storage with new in the input filter, pass it by reference (or via a pointer) through the pipeline, and then delete after all processing is done in the output filter. This is surely necessary for raw storage. For pixels however, you can keep using an automatic variable if all operations that need it - i.e. unpacking, processing, and packing the result - are done within the body of the middle filter. Of course the declaration of the variable should move there as well.
Let me sketch a modification to your serial code to make it more ready for applying parallel_pipeline. Note that I changed raw to be a dynamically allocated array, rather than std::vector; the code you showed seemingly did not use it as a vector anyway. Be aware that it's just a sketch, and it might not work as is.
uint64_t cbRaw=uint64_t(w)*h*bits/8;
uint64_t * raw; // now a pointer to a dynamically allocated array
while(1){
{ // The input stage
raw = new uint64_t[cbRaw/8];
if(!read_blob(STDIN_FILENO, cbRaw, raw)) {
delete[] raw;
break; // No more images
}
}
{ // The second stage
std::vector<uint32_t> pixels(w*h);
unpack_blob(w, h, bits, raw, &pixels[0]);
process(levels, w, h, bits, pixels);
//invert(levels, w, h, bits, pixels);
pack_blob(w, h, bits, &pixels[0], raw);
}
{ // The output stage
write_blob(STDOUT_FILENO, cbRaw, raw);
delete[] raw;
}
}
There is a tutorial on the pipeline in the TBB documentation. Try matching your code to the example there; it should be pretty easy to do. You may also ask for help at the TBB forum.

Related

How to copy every N-th byte(s) of a C array

I am writing bit of code in C++ where I want to play a .wav file and perform an FFT (with fftw) on it as it comes (and eventually display that FFT on screen with ncurses). This is mainly just as a "for giggles/to see if I can" project, so I have no restrictions on what I can or can't use aside from wanting to try to keep the result fairly lightweight and cross-platform (I'm doing this on Linux for the moment). I'm also trying to do this "right" and not just hack it together.
I'm using SDL2_audio to achieve the playback, which is working fine. The callback is called at some interval requesting N bytes (seems to be desiredSamples*nChannels). My idea is that at the same time I'm copying the memory from my input buffer to SDL I might as well also copy it in to fftw3's input array to run an FFT on it. Then I can just set ncurses to refresh at whatever rate I'd like separate from the audio callback frequency and it'll just pull the most recent data from the output array.
The catch is that the input file is formatted where the channels are packed together. I.E "(LR) (LR) (LR) ...". So while SDL expects this, I need a way to just get one channel to send to FFTW.
The audio callback format from SDL looks like so:
void myAudioCallback(void* userdata, Uint8* stream, int len) {
SDL_memset(stream, 0, sizeof(stream));
SDL_memcpy(stream, audio_pos, len);
audio_pos += len;
}
where userdata is (currently) unused, stream is the array that SDL wants filled, and len is the length of stream (I.E the number of bytes SDL is looking for).
As far as I know there's no way to get memcpy to just copy every other sample (read: Copy N bytes, skip M, copy N, etc). My current best idea is a brute-force for loop a la...
// pseudocode
for (int i=0; i<len/2; i++) {
fftw_in[i] = audio_pos + 2*i*sizeof(sample)
}
or even more brute force by just reading the file a second time and only taking every other byte or something.
Is there another way to go about accomplishing this, or is one of these my best option? It feels kind of kludgey to go from a nice one line memcpy to send to the data to SDL to some sort of weird loop to send it to fftw.
Very hard OP's solution can be simplified (for copying bytes):
// pseudocode
const char* s = audio_pos;
for (int d = 0; s < audio_pos + len; d++, s += 2*sizeof(sample)) {
fftw_in[d] = *s;
}
If I new what fftw_in is, I would memcpy blocks sizeof(*fftw_in).
Please check assembly generated by #S.M.'s solution.
If the code is not vectorized, I would use intrinsics (depending on your hardware support) like _mm_mask_blend_epi8

How to read a large amount of images (more than a million) and process them efficiently

I have a program which performs some computations on images to produce other images. The system works with small amount of images used as an input and I would like to know how to make it work for large amounts of input data; like a million or more images. My main concern is how to store the input images and how to store the output (produced) images.
I have a function compute(const std:vector<cv::Mat>&) which makes computations on 124 images (due to GPU memory limitations). Because the algorithm is iterative it takes different amount of iterations for each image to produce the output image.
Right now, if I provide more than 124 images then the function computes for the first 124 images and when an image finishes its computation, then it is swapped with another one. I would like the algorithm to be used with larger inputs, like a million, or more, images. The computation function returns one output image for each input image and it is implemented like:
std::vector<cv::Mat> compute(std::vector<cv::Mat>& image_vec) {
std::vector<cv::Mat> output_images(image_vec.size());
std::vector<cv::Mat> tmp_images(124);
processed_images = 0;
while (processed_images < image_vec.size()) {
// make computations and update the output_images
// for the images that are currently in processed
// remove the images that finished from the tmp_images
// and update the processed_images variable
// import new images to tmp_images unless there
// are no more in the input vector
}
return output_images;
}
I am using the boost::filesystem to read images from a folder (also I use OpenCV to read and store each image) at the beginning of the program:
std::vector<cv::Mat> read_images_from_dir(std::string dir_name) {
std::vector<cv::Mat> image_vec;
boost::filesystem::path p(dir_name);
std::vector<boost::filesystem::path> tmp_vec;
std::copy(boost::filesystem::directory_iterator(p),
boost::filesystem::directory_iterator(),
back_inserter(tmp_vec));
std::vector<boost::filesystem::path>::const_iterator it = tmp_vec.begin();
for (; it != tmp_vec.end(); ++it) {
if (is_regular_file(*it)) {
//std::cout << it->string() << std::endl;
image_vec.push_back(read_image(it->string()));
}
}
return image_vec;
}
And then the main program looks like this:
void main(int argc, char* argv[]) {
// suppose for this example that argv[1] contains a correct
// path to a folder which contains images
std::vector<cv::Mat> input_images = read_images_from_dir(argv[1]);
std::vector<cv::Mat> output_images = compute(input_images);
// save the output_images
}
Here you can find the program in an online editor, if you wish.
Any sugestion that clarifies the question is welcomed.
Edit: Some of the answers and comments pointed out useful design decisions that I have to make, so that you will be able to answer the question. I would like to mention that:
The images are/will be already stored in the disk before the program starts.
The process will be done "offline" without new data comming and will be done once every few hours (or days). This is because the parameters of the program will change after it finishes.
I can tolerate not having the fastest possible implementation at first because I want to make things work and then consider for optimizations.
The code has many computations so the I/O so far does not take that much time.
I expect this code to run on a single machine, but I think it will be better to NOT have multithreading, as a first version, so that the code will be more portable and so that I can integrate it in another program that does not use mutlithreading and I do not want to have more dependencies.
One implementations that I thought about, is reading batches of data (say 5K images) and after computing their output load new data. But I do not know if there is something far better without too much additional complexity. Of course, any answer is welcomed.

Accessing buffer using C++-AMP

Could somebody please help me understand exactly the step that is not working here?
I am trying to use C++-AMP to do parallel-for loops, however despite having no trouble or errors going through my process, I can't get my final data.
I want to pull out my data by means of mapping it
m_pDeviceContext->Map(pBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &MappedResource);
{
blah
}
But I've worked on this for days on end without even a single inch of progress.
Here is everything I do with C++-AMP:
Constructor: I initialise my variables because I have to
: m_AcceleratorView(concurrency::direct3d::create_accelerator_view(reinterpret_cast<IUnknown *>(_pDevice)))
, m_myArray(_uiNumElement, m_AcceleratorView)
I copy my initial data into the C++-AMP array
concurrency::copy(Data.begin(), m_myArray);
I do stuff to the data
concurrency::parallel_for_each(...) restrict(amp)
{
blah
}
All of this seems fine, I run into no errors.
However the next step I want to do is pull the data from the buffer, which doesn't seem to work:
ID3D11Buffer* pBuffer = reinterpret_cast<ID3D11Buffer *>(concurrency::direct3d::get_buffer(m_myArray));
When I map this data (deviceContext->Map) the data inside is 0x00000000
What step am I forgetting that will allow me to read this data? Even when I try to set the CPU read/write access type I get an error, and I didn't even see any of my references do it that way either:
m_Accelerator.set_default_cpu_access_type(concurrency::access_type::access_type_read_write);
This creates an error to say "accelerator does not support zero copy"
Can anyone please help me and tell me why I can't read my buffer, and how to fix it?
The following code should work for this. You should also check that the DX device you and the C++AMP accelerator are associated with the same hardware.
HRESULT hr = S_OK;
array<int, 1> arr(1024);
CComPtr<ID3D11Buffer> buffer;
IUnknown* unkBuf = get_buffer(arr);
hr = unkBuf->QueryInterface(__uuidof(ID3D11Buffer), reinterpret_cast<LPVOID*>(&buffer));
This question has an answer that shows you how to do the opposite.
Reading Buffer Data using C++ AMP

Data structure for quick access to glpyh textures via char

I am attempting to create an edit box that allows users to input text. I've been working on this for some time now and have tossed around different ideas. Ultimately, the one I think that would offer the best performance is to load all the characters from the .ttf (I'm using SDL to manage events, windows, text, and images for openGL) onto their own surface, and then render those surfaces onto textures one time. Then each frame, I can just bind an appropriate texture in the appropriate location.
However, now I'm thinking how to access these glyphs. My limited bkg would say something like this:
struct CharTextures {
char glpyh;
GLuint TextureID;
int Width;
int Height;
CharTextures* Next;
}
//Code
CharTexture* FindGlyph(char Foo) {
CharTextures* Poo = _FirstOne;
while( Poo != NULL ) {
if( Foo == Poo->glyph ) {
return Poo;
}
Poo = Poo->Next;
}
return NULL;
}
I know that will work. However, it seems very wasteful to iterate the entire list each time. My scripting experience has taught me some lua and they have tables in lua that allow for unordered indices of all sorts of types. How could I mimic it in C++ such that instead of this iteration, I could do something like:
CharTexture* FindGlyph(char Foo) {
return PooPointers[Foo]; //somehow use the character as a key to get pointer to glyph without iteration
}
I was thinking I could try converting to the numerical value, but I don't know how to convert char to UTF8 values and if I could use those as keys. I could convert to ascii but would that handle all the characters I would want to be able to type? I am trying to get this application to run on mac and windows and am not sure about the machine specifics. I've read about the differences of the different format (ascii v unicode v utf8 v utf16 etc)... I understand it has to do with bit width and endianness but I understand relatively little about the interface differences between platforms and implications of said endianness on my code.
Thank you
What you probably want is
std::map<char,CharTexture*> PooPointers;
using the array access operator will also use some search in the map behind the scene, but optimized.
What g-makulik has said is probably right. The map may be what you're after. To expand on the reply, maps are automatically sorted base on the key (char in this case) and so lookups based on the character is extremely quick using
CharTexture* pCharTexture = PooPointers[char];
If you want a sparse data structure where you don't predefine the texture for each character.
Note that running the code above where an entry doesn't exist will create a default entry in the map.
Depending on your general needs you could also use a simple vector if generalized sorting isn't important or if you know that you'll always have a fixed number of characters. You could fill the vector with predefined data for each possible character.
It all depends on your memory requirements.

What is the best way to return an image or video file from a function using c++?

I am writing a c++ library that fetches and returns either image data or video data from a cloud server using libcurl. I've started writing some test code but still stuck at designing API because I'm not sure about what's best way to handle these media files. Storing it in a char/string variable as binary data seems to work, but I wonder if that would take up too much RAM memory if the files are too big. I'm new to this, so please suggest a solution.
You can use something like zlib to compress it in memory, and then uncompress it only when it needs to be used; however, most modern computers have quite a lot of memory, so you can handle quite a lot of images before you need to start compressing. With videos, which are effectively a LOT of images, it becomes a bit more important -- you tend to decompress as you go, and possibly even stream-from-disk as you go.
The usual way to handle this, from an API point of view, is to have something like an Image object and a Video object (classes). These objects would have functions to "get" the uncompressed image/frame. The "get" function would check to see if the data is currently compressed; if it is, it would decompress it before returning it; if it's not compressed, it can return it immediately. The way the data is actually stored (compressed/uncompressed/on disk/in memory) and the details of how to work with it are thus hidden behind the "get" function. Most importantly, this model lets you change your mind later, adding additional types of compression, adding disk-streaming support, etc., without changing how the code that calls the get() function is written.
The other challenge is how you return an Image or Video object from a function. You can do it like this:
Image getImageFromURL( const std::string &url );
But this has the interesting problem that the image is "copied" during the return process (sometimes; depends how the compiler optimizes things). This way is more memory efficient:
void getImageFromURL( const std::string &url, Image &result );
This way, you pass in the image object into which you want your image loaded. No copies are made. You can also change the 'void' return value into some kind of error/status code, if you aren't using exceptions.
If you're worried about what to do, code for both returning the data in an array and for writing the data in a file ... and pass the responsability to choose to the caller. Make your function something like
/* one of dst and outfile should be NULL */
/* if dst is not NULL, dstlen specifies the size of the array */
/* if outfile is not NULL, data is written to that file */
/* the return value indicates success (0) or reason for failure */
int getdata(unsigned char *dst, size_t dstlen,
const char *outfile,
const char *resource);