Using Boost.Lockfree queue is slower than using mutexes - c++

Until now I was using std::queue in my project. I measured the average time which a specific operation on this queue requires.
The times were measured on 2 machines: My local Ubuntu VM and a remote server.
Using std::queue, the average was almost the same on both machines: ~750 microseconds.
Then I "upgraded" the std::queue to boost::lockfree::spsc_queue, so I could get rid of the mutexes protecting the queue. On my local VM I could see a huge performance gain, the average is now on 200 microseconds. On the remote machine however, the average went up to 800 microseconds, which is slower than it was before.
First I thought this might be because the remote machine might not support the lock-free implementation:
From the Boost.Lockfree page:
Not all hardware supports the same set of atomic instructions. If it is not available in hardware, it can be emulated in software using guards. However this has the obvious drawback of losing the lock-free property.
To find out if these instructions are supported, boost::lockfree::queue has a method called bool is_lock_free(void) const;.
However, boost::lockfree::spsc_queue does not have a function like this, which, for me, implies that it does not rely on the hardware and that is is always lockfree - on any machine.
What could be the reason for the performance loss?
Exmple code (Producer/Consumer)
// c++11 compiler and boost library required
#include <iostream>
#include <cstdlib>
#include <chrono>
#include <async>
#include <thread>
/* Using blocking queue:
* #include <mutex>
* #include <queue>
*/
#include <boost/lockfree/spsc_queue.hpp>
boost::lockfree::spsc_queue<int, boost::lockfree::capacity<1024>> queue;
/* Using blocking queue:
* std::queue<int> queue;
* std::mutex mutex;
*/
int main()
{
auto producer = std::async(std::launch::async, [queue /*,mutex*/]()
{
// Producing data in a random interval
while(true)
{
/* Using the blocking queue, the mutex must be locked here.
* mutex.lock();
*/
// Push random int (0-9999)
queue.push(std::rand() % 10000);
/* Using the blocking queue, the mutex must be unlocked here.
* mutex.unlock();
*/
// Sleep for random duration (0-999 microseconds)
std::this_thread::sleep_for(std::chrono::microseconds(rand() % 1000));
}
}
auto consumer = std::async(std::launch::async, [queue /*,mutex*/]()
{
// Example operation on the queue.
// Checks if 1234 was generated by the producer, returns if found.
while(true)
{
/* Using the blocking queue, the mutex must be locked here.
* mutex.lock();
*/
int value;
while(queue.pop(value)
{
if(value == 1234)
return;
}
/* Using the blocking queue, the mutex must be unlocked here.
* mutex.unlock();
*/
// Sleep for 100 microseconds
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
}
consumer.get();
std::cout << "1234 was generated!" << std::endl;
return 0;
}

Lock free algorithms generally perform more poorly than lock-based algorithms. That's a key reason they're not used nearly as frequently.
The problem with lock free algorithms is that they maximize contention by allowing contending threads to continue to contend. Locks avoid contention by de-scheduling contending threads. Lock free algorithms, to a first approximation, should only be used when it's not possible to de-schedule contending threads. That only rarely applies to application-level code.
Let me give you a very extreme hypothetical. Imagine four threads are running on a typical, modern dual-core CPU. Threads A1 and A2 are manipulating collection A. Threads B1 and B2 are manipulating collection B.
First, let's imagine the collection uses locks. That will mean that if threads A1 and A2 (or B1 and B2) try to run at the same time, one of them will get blocked by the lock. So, very quickly, one A thread and one B thread will be running. These threads will run very quickly and will not contend. Any time threads try to contend, the conflicting thread will get de-scheduled. Yay.
Now, imagine the collection uses no locks. Now, threads A1 and A2 can run at the same time. This will cause constant contention. Cache lines for the collection will ping-pong between the two cores. Inter-core buses may get saturated. Performance will be awful.
Again, this is highly exaggerated. But you get the idea. You want to avoid contention, not suffer through as much of it as possible.
However, now run this thought experiment again where A1 and A2 are the only threads on the entire system. Now, the lock free collection is probably better (though you may find that it's better just to have one thread in that case!).
Almost every programmer goes through a phase where they think that locks are bad and avoiding locks makes code go faster. Eventually, they realize that it's contention that makes things slow and locks, used correctly, minimize contention.

I cannot say that the boost lockfree queue is slower in all possible cases. In my experience, the push(const T& item) is trying to make a copy. If you are constructing tmp objects and pushing on the queue, then you are hit by a performance drag. I think the library just need the overloaded version push(T&& item) to make movable object more efficient. Before the addition of the new function, you might have to use pointers, the plain type, or the smart ones offered after C++11. This is a rather limited aspect of the queue, and I only use the lockfree queue vary rarely.

Related

The definition of lock-free

There are three different types of "lock-free" algorithms. The definitions given in Concurrency in Action are:
Obstruction-Free: If all other threads are paused, then any given
thread will complete its operation in a bounded number of steps.
Lock-Free: If multiple threads are operating on a data structure, then
after a bounded number of steps one of them will complete its
operation.
Wait-Free: Every thread operating on a data structure
will complete its operation in a bounded number of steps, even if
other threads are also operating on the data structure.
Herb Sutter says in his talk Lock-Free Programming:
Informally, "lock-free" ≈ "doesn't use mutexes" == any of these.
I do not see why lock-based algorithms can't fall into the lock-free definition given above. Here is a simple lock-based program:
#include <iostream>
#include <mutex>
#include <thread>
std::mutex x_mut;
void print(int& x) {
std::lock_guard<std::mutex> lk(x_mut);
std::cout << x;
}
void add(int& x, int y) {
std::lock_guard<std::mutex> lk(x_mut);
x += y;
}
int main() {
int i = 3;
std::thread thread1{print, std::ref(i)};
std::thread thread2(add, std::ref(i), 4);
thread1.join();
thread2.join();
}
If both of these threads are operating, then after a bounded number of steps, one of them must complete. Why does my program not satisfy the definition of "Lock-free"?
I would be careful about saying "bounded" without defining by what.
The canonical lock-free primitives - CAS loops do not give any bounds, if there is heavy contention, a thread can be repeatedly unlucky and wait forever, that is allowed. The defining property of lock-free algorithms is there is always system-wide progress. At any point of time, some thread will make progress.
Stronger guarantee of some progress for each thread at each point of time is called wait-free.
In other words, lock-free guarantees that a misbehaving thread cannot fatally impact all other threads, wait-free cannot fatally impact any thread.
If both of these threads are operating, then after a bounded number of steps, one of them must complete. Why does my program not satisfy the definition of "Lock-free"?
Because an (unfair) scheduler must be taken into account.* If a thread holding the lock is put to sleep, no other thread will be able to make any progress -> not-lock-free and there is certainly no bound. That won't happen with lock-free programming, resources are always available and unfortunate scheduling of one thread can make other threads' operations complete only faster, not slower.
* In particular, where the suspension time for any thread is not limited in frequency or duration. If it was, any algorithm would be wait-free (with some huge constant) by definition.
Your quote from Concurrency in Action is taken out of context.
In fact, what the book actually says is:
7.1 Definitions and consequences
Algorithms and data structures that use mutexes, condition variables,
and futures to synchronize the data are called blocking data
structures and algorithms.
Data structures and algorithms that don’t use blocking library
functions are said to be nonblocking. Not all these data structures
are lock-free ...
Then it proceeds to further break down nonblocking algorithms into Obstruction-Free, Lock-Free and Wait-Free.
So a Lock-Free algorithm is
a nonblocking algorithm (it does not use locks like a mutex) and
it is able to make progress in at least one thread in a bounded number of steps.
So both Herb Sutter and Anthony Williams are correct.

boost lockfree queue performance issue calling boost::lockfree::queue push(), pop(), empty() functions [duplicate]

So i'm using a boost::lockfree::spec_queue to communicate via two boost_threads running functors of two objects in my application.
All is fine except for the fact that the spec_queue::pop() method is non blocking. It returns True or False even if there is nothing in the queue. However my queue always seems to return True (problem #1). I think this is because i preallocate the queue.
typedef boost::lockfree::spsc_queue<q_pl, boost::lockfree::capacity<100000> > spsc_queue;
This means that to use the queue efficiently i have to busy wait constantly popping the queue using 100% cpu. Id rather not sleep for arbitrary amounts of time. I've used other queues in java which block until an object is made available. Can this be done with std:: or boost:: data structures?
A lock free queue, by definition, does not have blocking operations.
How would you synchronize on the datastructure? There is no internal lock, for obvious reasons, because that would mean all clients need to synchronize on it, making it your grandfathers locking concurrent queue.
So indeed, you will have to devise a waiting function yourself. How you do this depends on your use case, which is probably why the library doesn't supply one (disclaimer: I haven't checked and I don't claim to know the full documentation).
So what can you do:
As you already described, you can spin in a tight loop. Obviously, you'll do this if you know that your wait condition (queue non-empty) is always going to be satisfied very quickly.
Alternatively, poll the queue at a certain frequency (doing micro-sleeps in the mean time). Scheduling a good good frequency is an art: for some applications 100ms is optimal, for others, a potential 100ms wait would destroy throughput. So, vary and measure your performance indicators (don't forget about power consumption if your application is going to be deployed on many cores in a datacenter :)).
Lastly, you could arrive at a hybrid solution, spinning for a fixed number of iterations, and resorting to (increasing) interval polling if nothing arrives. This would nicely support servers applications where high loads occur in bursts.
Use a semaphore to cause the producers to sleep when the queue is full, and another semaphore to cause the consumers to sleep when the queue is empty.
when the queue is neither full nor empty, the sem_post and sem_wait operations are nonblocking (in newer kernels)
#include <semaphore.h>
template<typename lock_free_container>
class blocking_lock_free
{
public:
lock_free_queue_semaphore(size_t n) : container(n)
{
sem_init(&pop_semaphore, 0, 0);
sem_init(&push_semaphore, 0, n);
}
~lock_free_queue_semaphore()
{
sem_destroy(&pop_semaphore);
sem_destroy(&push_semaphore);
}
bool push(const lock_free_container::value_type& v)
{
sem_wait(&push_semaphore);
bool ret = container::bounded_push(v);
ASSERT(ret);
if (ret)
sem_post(&pop_semaphore);
else
sem_post(&push_semaphore); // shouldn't happen
return ret;
}
bool pop(lock_free_container::value_type& v)
{
sem_wait(&pop_semaphore);
bool ret = container::pop(v);
ASSERT(ret);
if (ret)
sem_post(&push_semaphore);
else
sem_post(&pop_semaphore); // shouldn't happen
return ret;
}
private:
lock_free_container container;
sem_t pop_semaphore;
sem_t push_semaphore;
};

boost::lockfree::spsc_queue busy wait strategy. Is there a blocking pop?

So i'm using a boost::lockfree::spec_queue to communicate via two boost_threads running functors of two objects in my application.
All is fine except for the fact that the spec_queue::pop() method is non blocking. It returns True or False even if there is nothing in the queue. However my queue always seems to return True (problem #1). I think this is because i preallocate the queue.
typedef boost::lockfree::spsc_queue<q_pl, boost::lockfree::capacity<100000> > spsc_queue;
This means that to use the queue efficiently i have to busy wait constantly popping the queue using 100% cpu. Id rather not sleep for arbitrary amounts of time. I've used other queues in java which block until an object is made available. Can this be done with std:: or boost:: data structures?
A lock free queue, by definition, does not have blocking operations.
How would you synchronize on the datastructure? There is no internal lock, for obvious reasons, because that would mean all clients need to synchronize on it, making it your grandfathers locking concurrent queue.
So indeed, you will have to devise a waiting function yourself. How you do this depends on your use case, which is probably why the library doesn't supply one (disclaimer: I haven't checked and I don't claim to know the full documentation).
So what can you do:
As you already described, you can spin in a tight loop. Obviously, you'll do this if you know that your wait condition (queue non-empty) is always going to be satisfied very quickly.
Alternatively, poll the queue at a certain frequency (doing micro-sleeps in the mean time). Scheduling a good good frequency is an art: for some applications 100ms is optimal, for others, a potential 100ms wait would destroy throughput. So, vary and measure your performance indicators (don't forget about power consumption if your application is going to be deployed on many cores in a datacenter :)).
Lastly, you could arrive at a hybrid solution, spinning for a fixed number of iterations, and resorting to (increasing) interval polling if nothing arrives. This would nicely support servers applications where high loads occur in bursts.
Use a semaphore to cause the producers to sleep when the queue is full, and another semaphore to cause the consumers to sleep when the queue is empty.
when the queue is neither full nor empty, the sem_post and sem_wait operations are nonblocking (in newer kernels)
#include <semaphore.h>
template<typename lock_free_container>
class blocking_lock_free
{
public:
lock_free_queue_semaphore(size_t n) : container(n)
{
sem_init(&pop_semaphore, 0, 0);
sem_init(&push_semaphore, 0, n);
}
~lock_free_queue_semaphore()
{
sem_destroy(&pop_semaphore);
sem_destroy(&push_semaphore);
}
bool push(const lock_free_container::value_type& v)
{
sem_wait(&push_semaphore);
bool ret = container::bounded_push(v);
ASSERT(ret);
if (ret)
sem_post(&pop_semaphore);
else
sem_post(&push_semaphore); // shouldn't happen
return ret;
}
bool pop(lock_free_container::value_type& v)
{
sem_wait(&pop_semaphore);
bool ret = container::pop(v);
ASSERT(ret);
if (ret)
sem_post(&push_semaphore);
else
sem_post(&pop_semaphore); // shouldn't happen
return ret;
}
private:
lock_free_container container;
sem_t pop_semaphore;
sem_t push_semaphore;
};

C++ Multithreading decoding audio data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I need to decode audio data as fast as possible using the Opus decoder.
Currently my application is not fast enough.
The decoding is as fast as it can get, but I need to gain more speed.
I need to decode about 100 sections of audio. T
hese sections are not consecutive (they are not related to each other).
I was thinking about using multi-threading so that I don't have to wait until one of the 100 decodings are completed. In my dreams I could start everything in parallel.
I have not used multithreading before.
I would therefore like to ask if my approach is generally fine or if there is a thinking mistake somewhere.
Thank you.
This answer is probably going to need a little refinement from the community, since it's been a long while since I worked in this environment, but here's a start -
Since you're new to multi-threading in C++, start with a simple project to create a bunch of pthreads doing a simple task.
Here's a quick and small example of creating pthreads:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void* ThreadStart(void* arg);
int main( int count, char** argv) {
pthread_t thread1, thread2;
int* threadArg1 = (int*)malloc(sizeof(int));
int* threadArg2 = (int*)malloc(sizeof(int));
*threadArg1 = 1;
*threadArg2 = 2;
pthread_create(&thread1, NULL, &ThreadStart, (void*)threadArg1 );
pthread_create(&thread2, NULL, &ThreadStart, (void*)threadArg2 );
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
free(threadArg1);
free(threadArg2);
}
void* ThreadStart(void* arg) {
int threadNum = *((int*)arg);
printf("hello world from thread %d\n", threadNum);
return NULL;
}
Next, you're going to be using multiple opus decoders. Opus appears to be thread safe, so long as you create separate OpusDecoder objects for each thread.
To feed jobs to your threads, you'll need a list of pending work units that can be accessed in a thread safe manner. You can use std::vector or std::queue, but you'll have to use locks around it when adding to it and when removing from it, and you'll want to use a counting semaphore so that the threads will block, but stay alive, while you slowly add workunits to the queue (say, buffers of files read from disk).
Here's some example code similar from above that shows how to use a shared queue, and how to make the threads wait while you fill the queue:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <queue>
#include <semaphore.h>
#include <unistd.h>
void* ThreadStart(void* arg);
static std::queue<int> workunits;
static pthread_mutex_t workunitLock;
static sem_t workunitCount;
int main( int count, char** argv) {
pthread_t thread1, thread2;
pthread_mutex_init(&workunitLock, NULL);
sem_init(&workunitCount, 0, 0);
pthread_create(&thread1, NULL, &ThreadStart, NULL);
pthread_create(&thread2, NULL, &ThreadStart, NULL);
// Make a bunch of workunits while the threads are running.
for (int i = 0; i < 200; i++ ){
pthread_mutex_lock(&workunitLock);
workunits.push(i);
sem_post(&workunitCount);
pthread_mutex_unlock(&workunitLock);
// Pretend that it takes some effort to create work units;
// this shows that the threads really do block patiently
// while we generate workunits.
usleep(5000);
}
// Sometime in the next while, the threads will be blocked on
// sem_wait because they're waiting for more workunits. None
// of them are quitting because they never saw an empty queue.
// Pump the semaphore once for each thread so they can wake
// up, see the empty queue, and return.
sem_post(&workunitCount);
sem_post(&workunitCount);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
pthread_mutex_destroy(&workunitLock);
sem_destroy(&workunitCount);
}
void* ThreadStart(void* arg) {
int workUnit;
bool haveUnit;
do{
sem_wait(&workunitCount);
pthread_mutex_lock(&workunitLock);
// Figure out if there's a unit, grab it under
// the lock, then release the lock as soon as we can.
// After we release the lock, then we can 'process'
// the unit without blocking everybody else.
haveUnit = !workunits.empty();
if ( haveUnit ) {
workUnit = workunits.front();
workunits.pop();
}
pthread_mutex_unlock(&workunitLock);
// Now that we're not under the lock, we can spend
// as much time as we want processing the workunit.
if ( haveUnit ) {
printf("Got workunit %d\n", workUnit);
}
}
while(haveUnit);
return NULL;
}
You would break your work up by task. Let's assume your process is in fact CPU bound (you indicate it is but… it's not usually that simple).
Right now, you decode 100 sections:
I was thinking about using multi-threading so that I don't have to wait until one of the 100 decodings are completed. In my dreams I could start everything in parallel.
Actually, you should use a number close to the number of cores on the machine.
Assuming a modern desktop (e.g. 2-8 cores), running 100 threads at once will just slow it down; The kernel will waste a lot of time switching from one thread to another and the process is also likely to use higher peak resources and contend for similar resources.
So just create a task pool which restricts the number of active tasks to the number of cores. Each task would (generally) represent the decoding work to perform for one input file (section). This way, the decoding process is not actually sharing data across multiple threads (allowing you to avoid locking and other resource contention).
When complete, go back and fine tune the number of processes in the task pool (e.g. using the exact same inputs and a stopwatch on multiple machines). The fastest may be lower or higher than the number of cores (most likely because of disk I/O). It also helps to profile.
I would therefore like to ask if my approach is generally fine or if there is a thinking mistake somewhere.
Yes, if the problem is CPU bound, then that is generally fine. This also assumes your decoder/dependent software is capable of running with multiple threads.
The problem you will realize if these are files on disk is that you will probably need to optimize how you read (and write?) the files from many cores. So allowing it to run 8 jobs at once can make your problem become disk bound -- and 8 simultaneous readers/writers is a bad way to use hard disks, so you may find that it is not as fast as you expected. Therefore, you may need to optimize I/O for your concurrent decode implementation. In this regard, using larger buffer sizes, but that comes at a cost in memory.
Instead of making your own threads and manage them, I suggest you use a thread pool and give your decoding tasks to the pool. The pool will assign tasks to as many threads as it and the system can handle. Though there are different types of thread pools so you can set some parameters like forcing it to use some specific number of threads or if you should allow the pool to keep increasing the number of threads.
One thing to keep in mind is that more threads doesn't mean they execute in parallel. I think the correct term is concurrently, unless you have the guarantee that each thread is run on a different CPU (which would give true parallelism)
Your entire pool can come to a halt if blocked for IO.
Before jumping into multithreading as solution to speed up things , Study the concept of Oversubscribing & under Subscribing .
If the processing of Audio involves .long blocking calls of IO , Then Multithreading is worth it.
Although the vagueness of you question doesn't really help...how about:
Create a list of audio files to convert.
While there is a free processor,
launch the decoder application with the next file in the queue.
Repeat until there is nothing else in the list
If, during testing, you discover the processors aren't always 100% busy, launch 2 decodes per processor.
It could be done quite easily with a bit of bash/tcl/python.
You can use threads in general but locking has some issues. I will base the answer around POSIX threads and locks but this is fairly general and you will able to port the idea to any platform. But if your jobs require any kind of locking, you may find the following useful. Also it is best to keep using the existing threads again and again because thread creations are costly(see thread pooling).
Locking is a bad idea in general for "realtime" audio since it adds latency, but that's for real time jobs for decoding/encoding they are perfectly ok, even for real time ones you can get better performance and no dropping frames by using some threading knowledge.
For audio, semaphores is a bad, bad idea. They are too slow on at least my system(POSIX semaphores) when I tried, but you will need them if you are thinking of cross thread locking(not the type of locking where you lock in one thread and unlock in the same thread). POSIX mutexes only allow self lock and self unlock(you have to do both in the same thread) otherwise the program might work but it's undefined behavior and should be avoided.
Most lock-free atomic operations might give you enough freedom from locks to use some functionality(like locking) but with better performance.

How to get multithreads working properly using pthreads and not boost in class using C++

I have a project that I have summarized here with some pseudo code to illustrate the problem.
I do not have a compiler issue and my code compiles well whether it be using boost or pthreads. Remember this is pseudo code designed to illustrate the problem and not directly compilable.
The problem I am having is that for a multithreaded function the memory usage and processing time is always greater than if the same function is acheived using serial programming e.g for/while loop.
Here is a simplified version of the problem I am facing:
class aproject(){
public:
typedef struct
{
char** somedata;
double output,fitness;
}entity;
entity **entity_array;
int whichthread,numthreads;
pthread_mutex_t mutexdata;
aproject(){
numthreads = 100;
*entity_array=new entity[numthreads];
for(int i;i<numthreads;i++){
entity_array[i]->somedata[i] = new char[100];
}
/*.....more memory allocations for entity_array.......*/
this->initdata();
this->eval_thread();
}
void initdata(){
/**put zeros and ones in entity_array**/
}
float somefunc(char *somedata){
float output=countzero(); //someother function not listed
return output;
}
void* thread_function()
{
pthread_mutex_lock (&mutexdata);
int currentthread = this->whichthread;
this->whichthread+=1;
pthread_mutex_unlock (&mutexdata);
entity *ent = this->entity_array[currentthread];
double A=0,B=0,C=0,D=0,E=0,F=0;
int i,j,k,l;
A = somefunc(ent->somedata[0]);
B = somefunc(ent->somedata[1]);
t4 = anotherfunc(A,B);
ent->output = t4;
ent->fitness = sqrt(pow(t4,2));
}
static void* staticthreadproc(void* p){
return reinterpret_cast<ga*>(p)->thread_function();
}
void eval_thread(){
//use multithreading to evaluate individuals in parallel
int i,j,k;
nthreads = this->numthreads;
pthread_t threads[nthreads];
//create threads
pthread_mutex_init(&this->mutexdata,NULL);
this->whichthread=0;
for(i=0;i<nthreads;i++){
pthread_create(&threads[i],NULL,&ga::staticthreadproc,this);
//printf("creating thread, %d\n",i);
}
//join threads
for(i=0;i<nthreads;i++){
pthread_join(threads[i],NULL);
}
}
};
I am using pthreads here because it works better than boost on machines with less memory.
Each thread is started in eval_thread and terminated there aswell. I am using a mutex to ensure every thread starts with the correct index for the entity_array, as each thread only applys its work to its respective entity_array indexed by the variable this->whichthread. This variable is the only thing that needs to be locked by the mutex as it is updated for every thread and must not be changed by other threads. You can happily ignore everything else apart from the thread_function, eval_threads, and the staticthreadproc as they are the only relevent functions assume that all the other functions apart from init to be both processor and memory intensive.
So my question is why is it that using multithreading in this way is IT more costly in memory and speed than the traditional method of not using threads at all?
I MUST REITERATE THE CODE IS PSEUDO CODE AND THE PROBLEM ISNT WHETHER IT WILL COMPILE
Thanks, I would appreciate any suggestions you might have for pthreads and/or boost solutions.
Each thread requires it's own call-stack, which consumes memory. Every local variable of your function (and all other functions on the call-stack) counts to that memory.
When creating a new thread, space for its call-stack is reserved. I don't know what the default-value is for pthreads, but you might want to look into that. If you know you require less stack-space than is reserved by default, you might be able to reduce memory-consumption significantly by explicitly specifying the desired stack-size when spawning the thread.
As for the performance-part - it could depend on several issues. Generally, you should not expect a performance boost from parallelizing number-crunching operations onto more threads than you have cores (don't know if that is the case here). This might end up being slower, due to the additional overhead of context-switches, increased amount of cache-misses, etc. There are ways to profile this, depending on your platform (for instance, the Visual Studio profiler can count cache-misses, and there are tools for Linux as well).
Creating a thread is quite an expensive operation. If each thread only does a very small amount of work, then your program may be dominated by the time taken to create them all. Also, a large number of active threads can increase the work needed to schedule them, degrading system performance. And, as another answer mentions, each thread requires its own stack memory, so memory usage will grow with the number of threads.
Another issue can be cache invalidation; when one thread writes its results in to its entity structure, it may invalidate nearby cached memory and force other threads to fetch data from higher-level caches or main memory.
You may have better results if you use a smaller number of threads, each processing a larger subset of the data. For a CPU-bound task like this, one thread per CPU is probably best - that means that you can keep all CPUs busy, and there's no need to waste time scheduling multiple threads on each. Make sure that the entities each thread works on are located together in the array, so it does not invalidate those cached by other threads.