Parallel program no speed increase vs linear program - c++

I have created a model program of a more complex program that will utilize multithreading and multiple harddrives to increase performance. The data size is so large that reading all data into memory will not be feasible so the data will be read, processed, and written back out in chunks. This test program uses pipeline design to be able to read, process and write out at the same time on 3 different threads. Because read and write are to different harddrive, there is no problem with read and write at the same time. However, the program utilizing multithread seems to run 2x slower than its linear version(also in the code). I have tried to have the read and write thread not be destoryed after running a chunk but the synchronization seem to have slowed it down even more than the current version. I was wondering if I am doing something wrong or how I can improve this. Thank You.
Tested using i3-2100 # 3.1ghz and 16GB ram.
#include <iostream>
#include <fstream>
#include <ctime>
#include <thread>
#define CHUNKSIZE 8192 //size of each chunk to process
#define DATASIZE 2097152 //total size of data
using namespace std;
int data[3][CHUNKSIZE];
int run = 0;
int totalRun = DATASIZE/CHUNKSIZE;
bool finishRead = false, finishWrite = false;
ifstream infile;
ofstream outfile;
clock_t starttime, endtime;
/*
Process a chunk of data(simulate only, does not require to sort all data)
*/
void quickSort(int arr[], int left, int right) {
int i = left, j = right;
int tmp;
int pivot = arr[(left + right) / 2];
while (i <= j) {
while (arr[i] < pivot) i++;
while (arr[j] > pivot) j--;
if (i <= j) {
tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
i++;
j--;
}
};
if (left < j) quickSort(arr, left, j);
if (i < right) quickSort(arr, i, right);
}
/*
Find runtime
*/
void diffclock(){
double diff = (endtime - starttime)/(CLOCKS_PER_SEC/1000);
cout<<"Total run time: "<<diff<<"ms"<<endl;
}
/*
Read a chunk of data
*/
void readData(){
for(int i = 0; i < CHUNKSIZE; i++){
infile>>data[run%3][i];
}
finishRead = true;
}
/*
Write a chunk of data
*/
void writeData(){
for(int i = 0; i < CHUNKSIZE; i++){
outfile<<data[(run-2)%3][i]<<endl;
}
finishWrite = true;
}
/*
Pipelines Read, Process, Write using multithread
*/
void threadtransfer(){
starttime = clock();
infile.open("/home/pcg/test/iothread/source.txt");
outfile.open("/media/pcg/Data/test/iothread/ThreadDuplicate.txt");
thread read, write;
run = 0;
readData();
run = 1;
readData();
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
run = 2;
while(run < totalRun){
//cout<<run<<endl;
finishRead = finishWrite = false;
read = thread(readData);
write = thread(writeData);
read.detach();
write.detach();
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
while(!finishRead||!finishWrite){} //check if next cycle is ready.
run++;
}
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
writeData();
run++;
writeData();
infile.close();
outfile.close();
endtime = clock();
diffclock();
}
/*
Linearly read, sort, and write a chunk and repeat.
*/
void lineartransfer(){
int totalRun = DATASIZE/CHUNKSIZE;
int holder[CHUNKSIZE];
starttime = clock();
infile.open("/home/pcg/test/iothread/source.txt");
outfile.open("/media/pcg/Data/test/iothread/Linearduplicate.txt");
run = 0;
while(run < totalRun){
for(int i = 0; i < CHUNKSIZE; i++) infile>>holder[i];
quickSort(holder, 0, CHUNKSIZE - 1);
for(int i = 0; i < CHUNKSIZE; i++) outfile<<holder[i]<<endl;
run++;
}
endtime = clock();
diffclock();
}
/*
Create large amount of data for testing
*/
void createData(){
outfile.open("/home/pcg/test/iothread/source.txt");
for(int i = 0; i < DATASIZE; i++){
outfile<<rand()<<endl;
}
outfile.close();
}
int main(){
int mode=0;
cout<<"Number of threads: "<<thread::hardware_concurrency()<<endl;
cout<<"Enter mode\n1.Create Data\n2.thread copy\n3.linear copy\ninput mode:";
cin>>mode;
if(mode == 1) createData();
else if(mode == 2) threadtransfer();
else if(mode == 3) lineartransfer();
return 0;
}

Don't busy-wait. This wastes precious CPU time and may well slow down the rest (not to mention the compiler can optimize it into an infinite loop because it can't guess whether those flags will change or not, so it's not even correct in the first place). And don't detach() either. Replace both detach() and busy-waiting with join():
while (run < totalRun) {
read = thread(readData);
write = thread(writeData);
quickSort(data[(run-1)%3], 0, CHUNKSIZE - 1);
read.join();
write.join();
run++;
}
As to the global design, well, ignoring the global variables I guess it's otherwise acceptable if you don't expect the processing (quickSort) part to ever exceed the read/write time. I for one would use message queues to pass the buffers between the various threads (which allows to add more processing threads if you need it, either doing the same tasks in parallel or different tasks in sequence) but maybe that's because I'm used to do it that way.

Since you are measuing time using clock on a Linux machine, I expect that the total CPU time is (roughly) the same whether you run one thread or multiple threads.
Maybe you want to use time myprog instead? Or use gettimeofday to fetch the time (which will give you a time in seconds + nanoseconds [although the nanoseconds may not be "accurate" down to the last digit].
Edit:
Next, don't use endl when writing to a file. It slows things down a lot, because the C++ runtime goes and flushes to the file, which is an operating system call. It is almost certainly somehow protected against multiple threads, so you have three threads doing write-data, a single line, synchronously, at a time. Most likely going to take nearly 3x as long as running a single thread. Also, don't write to the same file from three different threads - that's going to be bad in one way or another.

Please correct me if I am wrong, but it seems your threaded function is basically a linear function doing 3 times the work of your linear function?
In a threaded program you would create three threads and run the readData/quicksort functions once on each thread (distributing thee workload), but in your program it seems like the thread simulation is actually just reading three times, quicksorting three times, and writing three times, and totalling the time it takes to do all three of each.

Related

Allocate more RAM to executable

I've made a program which process a lot of data, and it takes forever at runtime, but looking in Task Manager I found out that the executable only uses a small part of my cpu and my RAM...
How can I tell my IDE to allocate more resources (as much as he can) to my program?
Running it in Release x64 helps but not enough.
#include <cstddef>
#include <iostream>
#include <utility>
#include <vector>
int main() {
using namespace std;
struct library {
int num = 0;
unsigned int total = 0;
int booksnum = 0;
int signup = 0;
int ship = 0;
vector<int> scores;
};
unsigned int libraries = 30000; // in the program this number is read a file
unsigned int books = 20000; // in the program this number is read a file
unsigned int days = 40000; // in the program this number is read a file
vector<int> scores(books, 0);
vector<library*> all(libraries);
for(auto& it : all) {
it = new library;
it->booksnum = 15000; // in the program this number is read a file
it->signup = 50000; // in the program this number is read a file
it->ship = 99999; // in the program this number is read a file
it->scores.resize(it->booksnum, 0);
}
unsigned int past = 0;
for(size_t done = 0; done < all.size(); done++) {
if(!(done % 1000)) cout << done << '-' << all.size() << endl;
for(size_t m = done; m < all.size() - 1; m++) {
all[m]->total = 0;
{
double run = past + all[m]->signup;
for(auto at : all[m]->scores) {
if(days - run > 0) {
all[m]->total += scores[at];
run += 1. / all[m]->ship;
} else
break;
}
}
}
for(size_t n = done; n < all.size(); n++)
for(size_t m = 0; m < all.size() - 1; m++) {
if(all[m]->total < all[m + 1]->total) swap(all[m], all[m + 1]);
}
past += all[done]->signup;
if (past > days) break;
}
return 0;
}
this is the cycle which takes up so much time... For some reason even using pointers to library doesn't optimize it
RAM doesn't make things go faster. RAM is just there to store data your program uses; if it's not using much then it doesn't need much.
Similarly, in terms of CPU usage, the program will use everything it can (the operating system can change priority, and there are APIs for that, but this is probably not your issue).
If you're seeing it using a fraction of CPU percentage, the chances are you're either waiting on I/O or writing a single threaded application that can only use a single core at any one time. If you've optimised your solution as much as possible on a single thread, then it's worth looking into breaking its work down across multiple threads.
What you need to do is use a tool called a profiler to find out where your code is spending its time and then use that information to optimise it. This will help you with microoptimisations especially, but for larger algorithmic changes (i.e. changing how it works entirely), you'll need to think about things at a higher level of abstraction.

c++ thread creation big overhead

I have the following code, which confuses me a lot:
float OverlapRate(cv::Mat& model, cv::Mat& img) {
if ((model.rows!=img.rows)||(model.cols!=img.cols)) {
return 0;
}
cv::Mat bgr[3];
cv::split(img, bgr);
int counter = 0;
float b_average = 0, g_average = 0, r_average = 0;
for (int i = 0; i < model.rows; i++) {
for (int j = 0; j < model.cols; j++) {
if((model.at<uchar>(i,j)==255)){
counter++;
b_average += bgr[0].at<uchar>(i, j);
g_average += bgr[1].at<uchar>(i, j);
r_average += bgr[2].at<uchar>(i, j);
}
}
}
b_average = b_average / counter;
g_average = g_average / counter;
r_average = r_average / counter;
counter = 0;
float b_stde = 0, g_stde = 0, r_stde = 0;
for (int i = 0; i < model.rows; i++) {
for (int j = 0; j < model.cols; j++) {
if((model.at<uchar>(i,j)==255)){
counter++;
b_stde += std::pow((bgr[0].at<uchar>(i, j) - b_average), 2);
g_stde += std::pow((bgr[1].at<uchar>(i, j) - g_average), 2);
r_stde += std::pow((bgr[2].at<uchar>(i, j) - r_average), 2);
}
}
}
b_stde = std::sqrt(b_stde / counter);
g_stde = std::sqrt(g_stde / counter);
r_stde = std::sqrt(r_stde / counter);
return (b_stde + g_stde + r_stde) / 3;
}
void work(cv::Mat& model, cv::Mat& img, int index, std::map<int, float>& results){
results[index] = OverlapRate(model, img);
}
int OCR(cv::Mat& a, std::map<int,cv::Mat>& b, const std::vector<int>& possible_values)
{
int recog_value = -1;
clock_t start = clock();
std::thread threads[10];
std::map<int, float> results;
for(int i=0; i<10; i++)
{
threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
}
for(int i=0; i<10; i++)
threads[i].join();
float min_score = 1000;
int min_index = -1;
for(auto& it:results)
{
if (it.second < min_score) {
min_score = it.second;
min_index = it.first;
}
}
clock_t end = clock();
clock_t t = end - start;
printf ("It took me %d clicks (%f seconds) .\n",t,((float)t)/CLOCKS_PER_SEC);
recog_value = min_index;
}
What the above code does is just simple optical character recognition. I have one optical character as an input and compare it with 0 - 9 ten standard character models to get the most similar one, and then output the recognized value.
When I execute the above code without using ten threads running at the same time, the time is 7ms. BUT, when I use ten threads, it drops down to 1 or 2 seconds for a single optical character recognition.
What is the reason?? The debug information tells that thread creation consumes a lot of time, which is this code:
threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
Why? Thanks.
Running multiple threads is useful in only 2 contexts: you have multiple hardware cores (so the threads can run simultaneously) OR each thread is waiting for IO (so one thread can run while another thread is waiting for IO, like a disk load or network transfer).
Your code is not IO bound, so I hope you have 10 cores to run your code. If you don't have 10 cores, then each thread will be competing for scarce resources, and the scarcest resource of all is L1 cache space. If all 10 threads are fighting for 1 or 2 cores and their cache space, then the caches will be "thrashing" and give you 10-100x slower performance.
Try running benchmarking your code 10 different times, with N=1 to 10 threads and see how it performs.
(There is one more reason the have multiple threads, which is when the cores support hyper threading. The OS will"pretend" that 1 core has 2 virtual processors, but with this you don't get 2x performance. You get something between 1x and 2x. But in order to get this partial boost, you have to run 2 threads per core)
Not always is efficient to use threads. If you use threads on small problem, then managing threads cost more time and resources then solving the problem. You must have enough work for threads and good managing work over threads.
If you want to know how many threads you can use on problem or how big must be problem, find Isoeffective functions (psi1, psi2, psi3) from theory of parallel computers.

threading program in C++ not faster

I have a program which reads the file line by line and then stores each possible substring of length 50 in a hash table along with its frequency. I tried to use threads in my program so that it will read 5 lines and then use five different threads to do the processing. The processing involves reading each substring of that line and putting them into hash map with frequency. But it seems there is something wrong which I could not figure out for which the program is not faster then the serial approach. Also, for large input file it is aborted. Here is the piece of code I am using
unordered_map<string, int> m;
mutex mtx;
void parseLine(char *line, int subLen){
int no_substr = strlen(line) - subLen;
for(int i = 0; i <= no_substr; i++) {
char *subStr = (char*) malloc(sizeof(char)* subLen + 1);
strncpy(subStr, line+i, subLen);
subStr[subLen]='\0';
mtx.lock();
string s(subStr);
if(m.find(s) != m.end()) m[s]++;
else {
pair<string, int> ret(s, 1);
m.insert(ret);
}
mtx.unlock();
}
}
int main(){
char **Array = (char **) malloc(sizeof(char *) * num_thread +1);
int num = 0;
while (NOT END OF FILE) {
if(num < num_th) {
if(num == 0)
for(int x = 0; x < num_th; x++)
Array[x] = (char*) malloc(sizeof(char)*strlen(line)+1);
strcpy(Array[num], line);
num++;
}
else {
vector<thread> threads;
for(int i = 0; i < num_th; i++) {
threads.push_back(thread(parseLine, Array[i]);
}
for(int i = 0; i < num_th; i++){
if(threads[i].joinable()) {
threads[i].join();
}
}
for(int x = 0; x < num_th; x++) free(seqArray[x]);
num = 0;
}
}
}
It's a myth that just by the virtue of using threads, the end result must be faster. In general, in order to take advantage of multithreading, two conditions must be met(*):
1) You actually have to have sufficient physical CPU cores, that can run the threads at the same time.
2) The threads have independent tasks to do, that they can do on their own.
From a cursory examination of the shown code, it seems to fail on the second part. It seems to me that, most of the time all of these threads will be fighting each other in order to acquire the same mutex. There's little to be gained from multithreading, in this situation.
(*) Of course, you don't always use threads for purely performance reasons. Multithreading also comes in useful in many other situations too, for example, in a program with a GUI, having a separate thread updating the GUI helps the UI working even while the main execution thread is chewing on something, for a while...

What is the overhead in splitting a for-loop into multiple for-loops, if the total work inside is the same? [duplicate]

This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
Performance of breaking apart one loop into two loops
(6 answers)
Closed 9 years ago.
What is the overhead in splitting a for-loop like this,
int i;
for (i = 0; i < exchanges; i++)
{
// some code
// some more code
// even more code
}
into multiple for-loops like this?
int i;
for (i = 0; i < exchanges; i++)
{
// some code
}
for (i = 0; i < exchanges; i++)
{
// some more code
}
for (i = 0; i < exchanges; i++)
{
// even more code
}
The code is performance-sensitive, but doing the latter would improve readability significantly. (In case it matters, there are no other loops, variable declarations, or function calls, save for a few accessors, within each loop.)
I'm not exactly a low-level programming guru, so it'd be even better if someone could measure up the performance hit in comparison to basic operations, e.g. "Each additional for-loop would cost the equivalent of two int allocations." But, I understand (and wouldn't be surprised) if it's not that simple.
Many thanks, in advance.
There are often way too many factors at play... And it's easy to demonstrate both ways:
For example, splitting the following loop results in almost a 2x slow-down (full test code at the bottom):
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
And this is almost stating the obvious since you need to loop through 3 times instead of once and you make 3 passes over the entire array instead of 1.
On the other hand, if you take a look at this question: Why are elementwise additions much faster in separate loops than in a combined loop?
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
The opposite is sometimes true due to memory alignment.
What to take from this?
Pretty much anything can happen. Neither way is always faster and it depends heavily on what's inside the loops.
And as such, determining whether such an optimization will increase performance is usually trial-and-error. With enough experience you can make fairly confident (educated) guesses. But in general, expect anything.
"Each additional for-loop would cost the equivalent of two int allocations."
You are correct that it's not that simple. In fact it's so complicated that the numbers don't mean much. A loop iteration may take X cycles in one context, but Y cycles in another due to a multitude of factors such as Out-of-order Execution and data dependencies.
Not only is the performance context-dependent, but it also vary with different processors.
Here's the test code:
#include <time.h>
#include <iostream>
using namespace std;
int main(){
int size = 10000;
int *data = new int[size];
clock_t start = clock();
for (int i = 0; i < 1000000; i++){
#ifdef TOGETHER
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
#else
for (int c = 0; c < size; c++){
data[c] *= 10;
}
for (int c = 0; c < size; c++){
data[c] += 7;
}
for (int c = 0; c < size; c++){
data[c] &= 15;
}
#endif
}
clock_t end = clock();
cout << (double)(end - start) / CLOCKS_PER_SEC << endl;
system("pause");
}
Output (one loop): 4.08 seconds
Output (3 loops): 7.17 seconds
Processors prefer to have a higher ratio of data instructions to jump instructions.
Branch instructions may force your processor to clear the instruction pipeline and reload.
Based on the reloading of the instruction pipeline, the first method would be faster, but not significantly. You would add at least 2 new branch instructions by splitting.
A faster optimization is to unroll the loop. Unrolling the loop tries to improve the ratio of data instructions to branch instructions by performing more instructions inside the loop before branching to the top of the loop.
Another significant performance optimization is to organize the data so it fits into one of the processor's cache lines. So for example, you could split have inner loops that process a single cache of data and the outer loop would load new items into the cache.
This optimizations should only be applied after the program runs correctly and robustly and the environment demands more performance. The environment defined as observers (animation / movies), users (waiting for a response) or hardware (performing operations before a critical time event). Any other purpose is a waste of your time, as the OS (running concurrent programs) and storage access will contribute more to your program's performance issues.
This will give you a good indication of whether or not one version is faster than another.
#include <array>
#include <chrono>
#include <iostream>
#include <numeric>
#include <string>
const int iterations = 100;
namespace
{
const int exchanges = 200;
template<typename TTest>
void Test(const std::string &name, TTest &&test)
{
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::duration<float, std::milli> ms;
std::array<float, iterations> timings;
for (auto i = 0; i != iterations; ++i)
{
auto t0 = Clock::now();
test();
timings[i] = ms(Clock::now() - t0).count();
}
auto avg = std::accumulate(timings.begin(), timings.end(), 0) / iterations;
std::cout << "Average time, " << name << ": " << avg << std::endl;
}
}
int main()
{
Test("single loop",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
// some more code
// even more code
}
});
Test("separated loops",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
}
for (auto i = 0; i < exchanges; ++i)
{
// some more code
}
for (auto i = 0; i < exchanges; ++i)
{
// even more code
}
});
}
The thing is quite simple. The first code is like taking a single lap on a race track and the other code is like taking a full 3-lap race. So, more time required to take three laps rather than one lap. However, if the loops are doing something that needs to be done in sequence and they depend on each other then second code will do the stuff. for example if first loop is doing some calculations and second loop is doing some work with those calculations then both loops need to be done in sequence otherwise not...

Multithreading taking equal time as single thread quick sorting

I'm working on linux but multithreading and single threading both are taking 340ms. Can someone tell me what's wrong with what I'm doing?
Here is my code
#include<time.h>
#include<fstream>
#define SIZE_OF_ARRAY 1000000
using namespace std;
struct parameter
{
int *data;
int left;
int right;
};
void readData(int *data)
{
fstream iFile("Data.txt");
for(int i = 0; i < SIZE_OF_ARRAY; i++)
iFile>>data[i];
}
int threadCount = 4;
int Partition(int *data, int left, int right)
{
int i = left, j = right, temp;
int pivot = data[(left + right) / 2];
while(i <= j)
{
while(data[i] < pivot)
i++;
while(data[j] > pivot)
j--;
if(i <= j)
{
temp = data[i];
data[i] = data[j];
data[j] = temp;
i++;
j--;
}
}
return i;
}
void QuickSort(int *data, int left, int right)
{
int index = Partition(data, left, right);
if(left < index - 1)
QuickSort(data, left, index - 1);
if(index < right)
QuickSort(data, index + 1, right);
}
//Multi threading code starts from here
void *Sort(void *param)
{
parameter *param1 = (parameter *)param;
QuickSort(param1->data, param1->left, param1->right);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
clock_t start, diff;
int *data = new int[SIZE_OF_ARRAY];
pthread_t threadID, threadID1;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
parameter param, param1;
readData(data);
start = clock();
int index = Partition(data, 0, SIZE_OF_ARRAY - 1);
if(0 < index - 1)
{
param.data = data;
param.left = 0;
param.right = index - 1;
pthread_create(&threadID, NULL, Sort, (void *)&param);
}
if(index < SIZE_OF_ARRAY - 1)
{
param1.data = data;
param1.left = index + 1;
param1.right = SIZE_OF_ARRAY;
pthread_create(&threadID1, NULL, Sort, (void *)&param1);
}
pthread_attr_destroy(&attr);
pthread_join(threadID, NULL);
pthread_join(threadID1, NULL);
diff = clock() - start;
cout<<"Sorting Time = "<<diff * 1000 / CLOCKS_PER_SEC<<"\n";
delete []data;
return 0;
}
//Multithreading Ends here
Single thread main function
int main(int argc, char *argv[])
{
clock_t start, diff;
int *data = new int[SIZE_OF_ARRAY];
readData(data);
start = clock();
QuickSort(data, 0, SIZE_OF_ARRAY - 1);
diff = clock() - start;
cout<<"Sorting Time = "<<diff * 1000 / CLOCKS_PER_SEC<<"\n";
delete []data;
return 0;
}
//Single thread code ends here
some of functions single thread and multi thread use same
clock returns total CPU time, not wall time.
If you have 2 CPUs and 2 threads, then after a second of running both thread simultaneously clock will return CPU time of 2 seconds (the sum of CPU times of each thread).
So the result is totally expected. It does not matter how many CPUs you have, the total running time summed over all CPUs will be the same.
Note that you call Partition once from the main thread...
The code works on the same memory block which prevents a CPU from working when the other accesses that same memory block. Unless your data is really large you're likely to have many such hits.
Finally, if your algorithm works at memory speed when you run it with one thread, adding more threads doesn't help. I did such tests a while back with image data, and having multiple thread decreased the total speed because the process was so memory intensive that both threads were fighting to access memory... and the result was worse than not having threads at all.
What makes really fast computers today go really is fast is running one very intensive process per computer, not a large number of threads (or processes) on a single computer.
Build a thread pool with a producer-consumer queue with 24 threads hanging off it. Partition your data into two and issue a mergesort task object to the pool, the mergesort object should issue further pairs of mergesorts to the queue and wait on a signal for them to finish and so on until a mergersort object finds that it has [L1 cache-size data]. The object then qicksorts its data and signals completion to its parent task.
If that doesn't turn out to be blindingly quick on 24 cores, I'll stop posting about threads..
..and it will handle multiple sorts in parallel.
..and the pool can be used for other tasks.
.. and there is no No performance-destroying, deadlock-generating join(), synchronize(), (if you except the P-C queue, which only locks for long enough to push an object ref on), no thread-creation overhead and no dodgy thread-stopping/terminating/destroying code. Like the barbers, there is no waiting - as soon as a thread is finished with a task it can get another.
No thread micro-management, no tuning, (you could create 64 threads now, ready for the next generation of boxes). You could make the thread count tuneable - just add more threads at runtime, or delete some by queueing up poison-pills.
You don't need a reference to the threads at all - just set 'em off, (pass queue as parameter).