boost::threads execution ordering

boost::threads execution ordering - c++

i have a problem with the order of execution of the threads created consecutively.
here is the code.
#include <iostream>
#include <Windows.h>
#include <boost/thread.hpp>
using namespace std;
boost::mutex mutexA;
boost::mutex mutexB;
boost::mutex mutexC;
boost::mutex mutexD;
void SomeWork(char letter, int index)
{
boost::mutex::scoped_lock lock;
switch(letter)
{
case 'A' : lock = boost::mutex::scoped_lock(mutexA); break;
case 'B' : lock = boost::mutex::scoped_lock(mutexB); break;
case 'C' : lock = boost::mutex::scoped_lock(mutexC); break;
case 'D' : lock = boost::mutex::scoped_lock(mutexD); break;
}
cout << letter <<index << " started" << endl;
Sleep(800);
cout << letter<<index << " finished" << endl;
}
int main(int argc , char * argv[])
{
for(int i = 0; i < 16; i++)
{
char x = rand() % 4 + 65;
boost::thread tha = boost::thread(SomeWork,x,i);
Sleep(10);
}
Sleep(6000);
system("PAUSE");
return 0;
}
each time a letter (from A to D) and a genereaion id (i) is passed to the method SomeWork as a thread. i do not care about the execution order between letters but for a prticular letter ,say A, Ax has to start before Ay, if x < y.
a random part of a random output of the code is :
B0 started
D1 started
C2 started
A3 started
B0 finished
B12 started
D1 finished
D15 started
C2 finished
C6 started
A3 finished
A9 started
B12 finished
B11 started --> B11 started after B12 finished.
D15 finished
D13 started
C6 finished
C7 started
A9 finished
how can avoid such conditions?
thanks.
i solved the problem using condition variables. but i changed the problem a bit. the solution is to keep track of the index of the for loop. so each thread knows when it does not work. but as far as this code is concerned, there are two other things that i would like to ask about.
first, on my computer, when i set the for-loop index to 350 i had an access violation. 310 was the number of loops, which was ok. so i realized that there is a maximum number of threads to be generated. how can i determine this number?
second, in visual studio 2008, the release version of the code showed a really strange behaviour. without using condition variables (lines 1 to 3 were commented out), the threads were ordered. how could that happen?
here is the code:
#include <iostream>
#include <Windows.h>
#include <boost/thread.hpp>
using namespace std;
boost::mutex mutexA;
boost::mutex mutexB;
boost::mutex mutexC;
boost::mutex mutexD;
class cl
{
public:
boost::condition_variable con;
boost::mutex mutex_cl;
char Letter;
int num;
cl(char letter) : Letter(letter) , num(0)
{
}
void doWork( int index, int tracknum)
{
boost::unique_lock<boost::mutex> lock(mutex_cl);
while(num != tracknum) // line 1
con.wait(lock); // line 2
Sleep(10);
num = index;
cout << Letter<<index << endl;
con.notify_all(); // line 3
}
};
int main(int argc , char * argv[])
{
cl A('A');
cl B('B');
cl C('C');
cl D('D');
for(int i = 0; i < 100; i++)
{
boost::thread(&cl::doWork,&A,i+1,i);
boost::thread(&cl::doWork,&B,i+1,i);
boost::thread(&cl::doWork,&C,i+1,i);
boost::thread(&cl::doWork,&D,i+1,i);
}
cout << "************************************************************************" << endl;
Sleep(6000);
system("PAUSE");
return 0;
}

If you have two different threads waiting for the lock, it's entirely non-deterministic which one will acquire it once the lock is released by the previous holder. I believe this is what you are experiencing. Assume B10 is holding the lock, and in the mean time threads are spawned for B11 and B12. B10 releases the lock - it's down to a coin toss as to whether B11 or B12 acquires it next, irrespective of which thread was created first, or even which thread started waiting first.
Perhaps you should implement work queues for each letter, such that you spawn exactly 4 threads, each of which consume work units? This is the only way to easily guarantee ordering in this way. A simple mutex is not going to guarantee ordering if multiple threads are waiting for the lock.

Even though B11 is started before B12 it is not guaranteed to be given a CPU time slice to execute SomeWork() prior to B12. This decision is up to the OS and its scheduler.
Mutex's are typically used to synchronize access to data between threads and a concern has been raised with the sequence of thread execution (i.e. data access).
If the threads for group 'A' are executing the same code on the same data then just use one thread. This will eliminate context switching between threads in the group and yield the same result. If the data is changing consider a producer/consumer pattern. Paul Bridger give's an easy to understand producer/consumer example here.

Your threads have dependencies that must be satisfied before they start execution. In your example, B12 depends on B0 and B11. Somehow you have to track that dependency knowledge. Threads with unfinished dependencies must be made to wait.
I would look into condition variables. Each time a thread finishes SomeWork() it would use the condition variable's notify_all() method. Then all of the waiting threads must check if they still have dependencies. If so, go back and wait. Otherwise, go ahead and call SomeWork().
You need some way for each thread to determine if it has unfinished dependencies. This will probably be some globally available entity. You should only modify it when you have the mutex (in SomeWork()). Reading by multiple threads should be safe for simple data structures.

Related

Multi-thread crawler doesn't speed up with threading (on local files)

I have a task - to write a multithreaded webcrawler (actually I have a local set.html files that I need to parse and move to another directory). The main condition for this task is to make it possible to enter an arbitrary number of threads and determine at what number the program will stop adding in performance.
#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
#include <queue>
#include <ctime>
#include <set>
#include <chrono>
#include <atomic>
using namespace std;
class WebCrawler{
private:
const string start_path = "/";
const string end_path = "/";
int thread_counts;
string home_page;
queue<string> to_visit;
set<string> visited;
vector<thread> threads;
mutex mt1;
int count;
public:
WebCrawler(int thread_counts_, string root_)
:thread_counts(thread_counts_), home_page(root_) {
to_visit.push(root_);
visited.insert(root_);
count = 0;
}
void crawler(){
for(int i = 0; i<thread_counts; i++)
threads.push_back(thread(&WebCrawler::start_crawl, this));
for(auto &th: threads)
th.join();
cout<<"Count: "<<count<<endl;
}
void parse_html(string page_){
cout<<"Thread: "<<this_thread::get_id()<<" page: "<<page_<< endl;
ifstream page;
page.open(start_path+page_, ios::in);
string tmp;
getline(page, tmp);
page.close();
for(int i = 0; i<tmp.size(); i++){
if( tmp[i] == '<'){
string tmp_num ="";
while(tmp[i]!= '>'){
if(isdigit(tmp[i]))
tmp_num+=tmp[i];
i++;
}
tmp_num+= ".html";
if((visited.find(tmp_num) == visited.end())){
mt1.lock();
to_visit.push(tmp_num);
visited.insert(tmp_num);
mt1.unlock();
}
}
}
}
void move(string page_){
mt1.lock();
count++;
ofstream page;
page.open(end_path+page_, ios::out);
page.close();
mt1.unlock();
}
void start_crawl(){
cout<<"Thread started: "<<this_thread::get_id()<< endl;
string page;
while(!to_visit.empty()){
mt1.lock();
page = to_visit.front();
to_visit.pop();
mt1.unlock();
parse_html(page);
move(page);
}
}
};
int main(int argc, char const *argv\[])
{
int start_time = clock();
WebCrawler crawler(7, "0.html");
crawler.crawler();
int end_time = clock();
cout<<"Time: "<<(float)(end_time -start_time)/CLOCKS_PER_SEC<<endl;
cout<<thread::hardware_concurrency()<<endl;
return 0;
}
1 thread = Time: 0.709504
2 thread = Time: 0.668037
4 thread = Time: 0.762967
7 thread = Time: 0.781821
I've been trying to figure out for a week why my program is running slower even on two threads. I probably don't fully understand how mutex works, or perhaps the speed is lost during the joining of threads. Do you have any ideas how to fix it?

There are many ways to protect things in multithreading, implicit or explicit.
In addition to the totally untested code, there are also some implicit assumptions, for example of that int is large enough for your task, that must be considered.
Lets make a short analysis of what is needing protection.
Variables that are accessed from multiple threads
things that are const can be excluded
unless you const cast them
part of them are mutable
global objects like files or cout
could be overwritten
written from multiple threads
streams have their own internal locks
so you can write to a stream from multiple threads to cout
but you don't want it for the files in this case.
if multiple threads want to open the same file, you will get an error.
std::endl forces an synchronization, so change it to "\n" like a commenter noted.
So this boils down to:
queue<string> to_visit;
set<string> visited; // should be renamed visiting
int count;
<streams and files>
count is easy
std::atomic<int> count;
The files are implicit protected by your visited/visiting check, so they are good too. So the mutex in move can be removed.
The remaining needs an mutex each as they could be independently updated.
mutex mutTovisit, // formerly known as mut1.
mutVisiting.
Now we have the problem that we could deadlock with two mutexes, if we try to lock in different order in two places. You need to read up on all the lock stuff if you add more locks, scoped_lock and lock are good places to start.
Changing the code to
{
scoped_lock visitLock(mutVisiting); // unlocks at end of } even if stuff throws
if((visited.find(tmp_num) == visited.end())){
scoped_lock toLock(mutTo);
to_visit.push(tmp_num);
visited.insert(tmp_num);
}
}
And in this code there are multiple errors, that are hidden by the not thread safe access to to_visit and the randomness of the thread starts.
while(!to_visit.empty()){ // 2. now the next thread starts and sees its empty and stops
// 3. or worse it starts then hang at lock
mt1.lock();
page = to_visit.front(); // 4. and does things that are not good here with an empty to_visit
to_visit.pop(); // 1. is now empty after reading root
mt1.unlock();
parse_html(page);
move(page);
}
To solve this you need an (atomic?) counter, found(Pages) of current known unvisited pages so we know if are done. Then to start threads when there is new work that needs to be done we can use std::condition_variable(_any)
The general idea of the plan is to have the threads wait until work is available, then each time a new page is discovered notify_one to start work.
To Startup, set the found to 1 and notify_one once the threads have started, when a thread is done with the work decrease found. To stop when found is zero, the thread that decrease it to zero notify_all so they all can stop.
What you will find is that if the data is on a single slow disk, it is unlikely you will see much effect from more than 2 threads, if all files are currently cached in ram, you might see more effect.

I think there's a bottle neck on your move function. Each thread takes the same amount of time to go through that function. You could start with that

OpenMP integer copied after tasks finish

I do not know if this is documented anywhere, if so I would love a reference to it, however I have found some unexpected behaviour when using OpenMP. I have a simple program below to illustrate the issue. Here in point form I will tell what I expect the program to do:
I want to have 2 threads
They both share an integer
The first thread increments the integer
The second thread reads the integer
Ater incrementing once, an external process must tell the first thread to continue incrementing (via a mutex lock)
The second thread is in charge of unlocking this mutex
As you will see, the counter which is shared between the threads is not altered properly for the second thread. However, if I turn the counter into an integer refernce instead, I get the expected result. Here is a simple code example:
#include <mutex>
#include <thread>
#include <chrono>
#include <iostream>
#include <omp.h>
using namespace std;
using std::this_thread::sleep_for;
using std::chrono::milliseconds;
const int sleep_amount = 2000;
int main() {
int counter = 0; // if I comment this and uncomment the 2 lines below, I get the expected results
/* int c = 0; */
/* int &counter = c; */
omp_lock_t mut;
omp_init_lock(&mut);
int counter_1, counter_2;
#pragma omp parallel
#pragma omp single
{
#pragma omp task default(shared)
// The first task just increments the counter 3 times
{
while (counter < 3) {
omp_set_lock(&mut);
counter += 1;
cout << "increasing: " << counter << endl;
}
}
#pragma omp task default(shared)
{
sleep_for(milliseconds(sleep_amount));
// While sleeping, counter is increased to 1 in the first task
counter_1 = counter;
cout << "counter_1: " << counter << endl;
omp_unset_lock(&mut);
sleep_for(milliseconds(sleep_amount));
// While sleeping, counter is increased to 2 in the first task
counter_2 = counter;
cout << "counter_2: " << counter << endl;
omp_unset_lock(&mut);
// Release one last time to increment the counter to 3
}
}
omp_destroy_lock(&mut);
cout << "expected: 1, actual: " << counter_1 << endl;
cout << "expected: 2, actual: " << counter_2 << endl;
cout << "expected: 3, actual: " << counter << endl;
}
Here is my output:
increasing: 1
counter_1: 0
increasing: 2
counter_2: 0
increasing: 3
expected: 1, actual: 0
expected: 2, actual: 0
expected: 3, actual: 3
gcc version: 9.4.0
Additional discoveries:
If I use OpenMP 'sections' instead of 'tasks', I get the expected result as well. The problem seems to be with 'tasks' specifically
If I use posix semaphores, this problem also persists.

This is not permitted to unlock a mutex from another thread. Doing it causes an undefined behavior. The general solution is to use semaphores in this case. Wait conditions can also help (regarding the real-world use cases). To quote the OpenMP documentation (note that this constraint is shared by nearly all mutex implementation including pthreads):
A program that accesses a lock that is not in the locked state or that is not owned by the task that contains the call through either routine is non-conforming.
A program that accesses a lock that is not in the uninitialized state through either routine is non-conforming.
Moreover, the two tasks can be executed on the same thread or different threads. You should not assume anything about their scheduling unless you tell OpenMP to do so with dependencies. Here, it is completely compliant for a runtime to execute the tasks serially. You need to use OpenMP sections so multiple threads execute different sections. Besides, it is generally considered as a bad practice to use locks in tasks as the runtime scheduler is not aware of them.
Finally, you do not need a lock in this case: an atomic operation is sufficient. Fortunately, OpenMP supports atomic operations (as well as C++).
Additional notes
Note that locks guarantee the consistency of memory accesses in multiple threads thanks to memory barriers. Indeed, an unlock operation on a mutex cause a release memory barrier that make writes visible from others threads. A lock from another thread do an acquire memory barrier that force reads to be done after the lock. When lock/unlocks are not used correctly, the way memory accesses are done is not safe anymore and this cause some variable not to be updated from other threads for example. More generally, this also tends to create race conditions. Thus, put it shortly, don't do that.

C++ Producer Consumer Problem with condition variable + mutex + pthreads

I'm need to do the producer consumer problem in c++, solve for 1 consumer and 6 producer, and for 1 producer and 6 consumer, below is the statement of the question.
Question 1:
Imagine that you are waiting for some friends in a very busy restaurant and you are watching the staff, who wait on tables, bring food from the kitchen to their tables. This is an example of the classic "Producer-Consumer'' problem. There is a limit on servers and meals are constantly produced by the kitchen. Consider then that there is a limit on servers (consumers) and an "unlimited" supply of meals being produced by chefs (producers).
One approach to facilitate identification and thus reduce to a "producer-consumer" problem is to limit the number of consumers and thus limit the infinite number of meals.
produced in the kitchen. Thus, the existence of a traffic light is suggested to control the production order of the meals that will be taken by the attendants.
The procedure would be something like:
Create a semaphore;
Create the server and chef threads;
Produce as many meals as you can and keep a record of how many meals
are in the queue;
The server thread will run until it manages to deliver all the meals produced in the
tables; and
Threads must be "joined" with the main thread.
Also consider that there are 6 chefs and 1 attendant. If you want, you can consider that a chef takes 50 microseconds to produce a meal and the server takes 10 microseconds to deliver the
meal on the table. Set a maximum number of customers to serve. Print on the screen, at the end of the execution, which chef is most and least idle and how many meals each chef has produced.
Question 2:
Considering the restaurant described above. Now consider that there are 1 chef and 6 attendants. Assume that a chef takes 50 microseconds to produce a meal and the server takes 15 microseconds to deliver the meal to the table. Set a maximum number of customers to serve.
Print which server is the most and least idle and how many meals each server has delivered.
I managed to solve for 6 producers and 1 consumer, but for 6 consumers and 1 producer it's not working, it seems that the program gets stuck in some DeadLock. I'm grateful if anyone knows how to help.
#include <iostream>
#include <random>
#include <chrono>
#include <thread>
#include <mutex>
#include <deque>
//The mutex class is a synchronization primitive that can be used to protect shared data
//from being simultaneously accessed by multiple threads.
std::mutex semaforo; //semafoto para fazer o controle de acesso do buffer
std::condition_variable notifica; //variavel condicional para fazer a notificação de prato consumido/consumido
std::deque<int> buffer; //buffer de inteiros
const unsigned int capacidade_buffer = 10; //tamanho do buffer
const unsigned int numero_pratos = 25; //numeros de pratos a serem produzidos
void produtor()
{
unsigned static int contador_pratos_produzidos = 0;
while (contador_pratos_produzidos < numero_pratos)
{
std::unique_lock<std::mutex> locker(semaforo);
notifica.wait(locker, []
{ return buffer.size() < capacidade_buffer; });
std::this_thread::sleep_for(std::chrono::microseconds(50));
buffer.push_back(contador_pratos_produzidos);
if (contador_pratos_produzidos < numero_pratos)
{
contador_pratos_produzidos++;
}
locker.unlock();
notifica.notify_all();
}
}
void consumidor(int ID, std::vector<int> &consumido)
{
unsigned static int contador_pratos_consumidos = 0;
while (contador_pratos_consumidos < numero_pratos)
{
std::unique_lock<std::mutex> locker(semaforo);
notifica.wait(locker, []
{ return buffer.size() > 0; });
std::this_thread::sleep_for(std::chrono::microseconds(15));
buffer.pop_front();
if (contador_pratos_consumidos < numero_pratos)
{
contador_pratos_consumidos++;
consumido[ID]++;
}
locker.unlock();
notifica.notify_one();
}
}
int main()
{
//vetor para contagem do consumo de cada garcon
std::vector<int> consumido(6, 0);
//vetor de threads garcon(consumidores)
std::vector<std::thread> consumidores;
for (int k = 0; k < 6; k++)
{
consumidores.push_back(std::thread(consumidor, k, std::ref(consumido)));
}
//produtor/chef
std::thread p1(produtor);
for (auto &k : consumidores)
{
k.join();
}
p1.join();
int mais_ocioso = 200, menos_ocioso = 0, mais, menos;
for (int k = 0; k < 6; k++)
{
std::cout << "Garcon " << k + 1 << " entregou " << consumido[k] << " pratos\n";
if (consumido[k] > menos_ocioso)
{
menos = k + 1;
menos_ocioso = consumido[k];
}
if (consumido[k] < mais_ocioso)
{
mais = k + 1;
mais_ocioso = consumido[k];
}
}
std::cout << "\nO mais ocioso foi o garcon " << mais << " e o menos ocioso foi o garcon " << menos << "\n";
}

The same exact bug exists in both the consumer and the producer function. I'll explain one of them, and the same bug must also be fixed in the other one.
unsigned static int contador_pratos_consumidos = 0;
while (contador_pratos_consumidos < numero_pratos)
{
This static counter gets accessed and modified by multiple execution threads.
Any non-atomic object that's used by multiple execution threads must be properly sequenced (accessed only when holding an appropriate mutex).
If you focus your attention on the above two lines it should be obvious that this counter is accessed without the protection of any mutex. Once you realize that, the bug is obvious: at some point contador_pratos_consumidos will be exactly one less than numero_pratos. When that happens you can have multiple execution threads evaluating the while condition, at the same time, and all of them will happily conclude that it's true.
Multiple execution threads then enter the while loop. One will succeed in acquiring the mutex and consuming the "product", and finish. The remaining execution threads will wait forever, for another "product" that will never arrive. No more products will ever be produced. No soup for them.
The same bug also exists in the producer, except that the effects of the bug will be rather subtle: more products will end up being produced than there should be.
Of course, pedantically all of this is undefined behavior, so anything can really happen, but these are the typical, usual consequences this kind of undefined behavior. Both bugs must be fixed in order for this algorithm to work correctly.

Progress bar in Windows activity field?

I've written a c++ program that performs time consuming calculations and i want the user to be able to see the progress while the program is running in the background (minimized).
I'd like to use the same effect as chrome uses when downloading a file:
How do i access this feature? Can i use it in my c++ program?

If the time consuming operation can be performed inside a loop, and depending on whether or not it is a count controlled loop, you may be able to use thread and atomic to solve your problem.
If your processor architecture supports multithreading you can use threads to run calculations concurrently. The basic use of a thread is to run a function in parallel with the main thread, these operations may be effectively done at the same time, meaning you would be able to use the main thread to check the progress of your time consuming calculations. With parallel threads comes the problem of data races, wherein if two threads try to access or edit the same data, they could do so incorrectly and corrupt the memory. This can be solved with atomic. You could use an atomic_int to make sure two actions are never cause a data race.
A viable example:
#include <thread>
#include <mutex>
#include <atomic>
#include <iostream>
//function prototypes
void foo(std::mutex * mtx, std::atomic_int * i);
//main function
int main() {
//first define your variables
std::thread bar;
std::mutex mtx;
std::atomic_int value;
//store initial value just in case
value.store(0);
//create the thread and assign it a task by passing a function and any parameters of the function as parameters of thread
std::thread functionalThread;
functionalThread = std::thread(foo/*function name*/, &mtx, &value/*parameters of the function*/);
//a loop to keep checking value to see if it has reached its final value
//temp variable to hold value so that operations can be performed on it while the main thread does other things
int temp = value.load();
//double to hold percent value
double percent;
while (temp < 1000000000) {
//calculate percent value
percent = 100.0 * double(temp) / 1000000000.0;
//display percent value
std::cout << "The current percent is: " << percent << "%" << std::endl;
//get new value for temp
temp = value.load();
}
//display message when calculations complete
std::cout << "Task is done." << std::endl;
//when you join a thread you are essentially waiting for the thread to finish before the calling thread continues
functionalThread.join();
//cin to hold program from completing to view results
int wait;
std::cin >> wait;
//end program
return 0;
}
void foo(std::mutex * mtx, std::atomic_int * i) {
//function counts to 1,000,000,000 as fast as it can
for (i->store(0); i->load() < 1000000000; i->store(i->load() + 1)) {
//keep i counting
//the first part is the initial value, store() sets the value of the atomic int
//the second part is the exit condition, load() returns the currently stored value of the atomic
//the third part is the increment
}
}

what is the fastest way to notify another thread that data is available? any alternativies to spinning?

One my thread writes data to circular-buffer and another thread need to process this data ASAP. I was thinking to write such simple spin. Pseudo-code!
while (true) {
while (!a[i]) {
/* do nothing - just keep checking over and over */
}
// process b[i]
i++;
if (i >= MAX_LENGTH) {
i = 0;
}
}
Above I'm using a to indicate that data stored in b is available for processing. Probaly I should also set thread afinity for such "hot" process. Of course such spin is very expensive in terms of CPU but it's OK for me as my primary requirement is latency.
The question is - am I should really write something like that or boost or stl allows something that:
Easier to use.
Has roughly the same (or even better?) latency at the same time occupying less CPU resources?
I think that my pattern is so general that there should be some good implementation somewhere.
upd It seems my question is still too complicated. Let's just consider the case when i need to write some items to array in arbitrary order and another thread should read them in right order as items are available, how to do that?
upd2
I'm adding test program to demonstrate what and how I want to achive. At least on my machine it happens to work. I'm using rand to show you that I can not use general queue and I need to use array-based structure:
#include "stdafx.h"
#include <string>
#include <boost/thread.hpp>
#include "windows.h" // for Sleep
const int BUFFER_LENGTH = 10;
int buffer[BUFFER_LENGTH];
short flags[BUFFER_LENGTH];
void ProcessorThread() {
for (int i = 0; i < BUFFER_LENGTH; i++) {
while (flags[i] == 0);
printf("item %i received, value = %i\n", i, buffer[i]);
}
}
int _tmain(int argc, _TCHAR* argv[])
{
memset(flags, 0, sizeof(flags));
boost::thread processor = boost::thread(&ProcessorThread);
for (int i = 0; i < BUFFER_LENGTH * 10; i++) {
int x = rand() % BUFFER_LENGTH;
buffer[x] = x;
flags[x] = 1;
Sleep(100);
}
processor.join();
return 0;
}
Output:
item 0 received, value = 0
item 1 received, value = 1
item 2 received, value = 2
item 3 received, value = 3
item 4 received, value = 4
item 5 received, value = 5
item 6 received, value = 6
item 7 received, value = 7
item 8 received, value = 8
item 9 received, value = 9
Is my program guaranteed to work? How would you redesign it, probably using some of existent structures from boost/stl instead of array? Is it possible to get rid of "spin" without affecting latency?

If the consuming thread is put to sleep it takes a few microseconds for it to wake up. This is the process scheduler latency you cannot avoid unless the thread is busy-spinning as you do. The thread also needs to be real-time FIFO so that it is never put to sleep when it is ready to run but exhausted its time quantum.
So, there is no alternative that could match latency of busy spinning.
(Surprising you are using Windows, it is best avoided if you are serious about HFT).

This is what Condition Variables were designed for. std::condition_variable is defined in the C++11 standard library.
What exactly is fastest for your purposes depends on your problem; You can attack it from several angles, but CVs (or derivative implementations) are a good starting point for understanding the subject better and approaching an implementation.

Consider using C++11 library if your compiler supports it. Or boost analog if not. And in your case especially std::future with std::promise.
There is a good book about threading and C++11 threading library:
Anthony Williams. C++ Concurrency in Action (2012)
Example from cppreference.com:
#include <iostream>
#include <future>
#include <thread>
int main()
{
// future from a packaged_task
std::packaged_task<int()> task([](){ return 7; }); // wrap the function
std::future<int> f1 = task.get_future(); // get a future
std::thread(std::move(task)).detach(); // launch on a thread
// future from an async()
std::future<int> f2 = std::async(std::launch::async, [](){ return 8; });
// future from a promise
std::promise<int> p;
std::future<int> f3 = p.get_future();
std::thread( [](std::promise<int>& p){ p.set_value(9); },
std::ref(p) ).detach();
std::cout << "Waiting..." << std::flush;
f1.wait();
f2.wait();
f3.wait();
std::cout << "Done!\nResults are: "
<< f1.get() << ' ' << f2.get() << ' ' << f3.get() << '\n';
}

If you want a fast method then simply drop to making OS calls. Any C++ library wrapping them is going to be slower.
e.g. On Windows your consumer can call WaitForSingleObject(), and your data-producing thread can wake the consumer using SetEvent(). http://msdn.microsoft.com/en-us/library/windows/desktop/ms687032(v=vs.85).aspx
For Unix, here is a similar question with answers: Windows Event implementation in Linux using conditional variables?

Do you really need threading?
A single threaded app is trivially simple and eliminates all the issues with thread safety and the overhead of launching threads. I did a study of threaded vs non threaded code to append text to a log file. The non threaded code was better in every measure of performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js