Empty element in array-based bounded buffer - c++

I have a classic producer/consumer problem. The code for producer is this:
#define BUFFER_SIZE 10
while (true) {
/* Produce an item */
while (( (in + 1) % BUFFER_SIZE) == out)
; /* do nothing -- no free buffers */
buffer[in] = item;
in = (in + 1) % BUFFER_SIZE;
}
And the consumer is:
while (true) {
while (in == out)
; // do nothing -- nothing to consume
// remove an item from the buffer
item = buffer[out];
out = (out + 1) % BUFFER_SIZE;
return item;
}
This works fine but the problem is that when the first eight elements are filled up and in=9 and out=0, the producer sits there and does not fill the last (ninth) element. This also happens when say, in=4 and out=5. In every case, one element is left empty and the queue appears to be "full" even though one slot is still empty.
I can come up with a few complicated checks but I need to know if there is a clean solution to filling the whole queue. I have tried incrementing in first and then putting the item in but that also runs into similar problems. (Initializing with -1 for both in and out doesn't work either).

See Wikipedia on this very topic. The simplest solutions appear to be either of:
Always keep one slot empty: if in == out then the buffer is empty; if in == out - 1, buffer is full
Replace out with num_unread_items; perform simple maths to retrieve out from num_unread_items and in
... but there are other options related to counting the number of read & write operations (either in separate variables or directly in in and out); or tracking whether the last operation was a read or a write in addition to the current in and out, which allows you to disambiguate the buffer-full/buffer-empty cases.

In an OS development class in college, I had an adjunct teacher that claimed it was impossible to have a software-only solution that could use all N elements in the buffer.
I proved him wrong with something I decided to call the race track solution (inspired by the fact that I like to run track).
On a race track, you are not limited to a 400 meter race; a race can consist of more than one lap. What happens if two runners are neck and neck
in a race? How do you know whether they are tied, or whether one runner has lapped the other? The answer is simple: in a race, we don't monitor a runner's position
on the track; we monitor the distance each runner has traversed. Thus, when two runners are neck and neck, we can disambiguafy between a tie and when one runner has
lapped the other.
So, our algorithm has an N-element array, and manages a 2N race. We don't restart the producer/consumer's counter back to zero until they finish their respective 2N race.
We don't allow the producer to be more than one lap ahead of the consumer, and we don't allow the consumer to be ahead of the producer.
Actually, we only have to monitor the distance between the producer and consumer.
The code is as follows:
Item track[LAP];
int consIdx = 0;
int prodIdx = 0;
void consumer()
{ while(true)
{ int diff = abs(prodIdx - consIdx);
if(0 < diff) //If the consumer isn't tied
{ track[consIdx%LAP] = null;
prodIdx = (prodIdx + 1) % (2*LAP);
}
}
}
void producer()
{ while(true)
{ int diff = (prodIdx - consIdx);
if(diff < LAP) //If prod hasn't lapped cons
{ track[prodIdx%LAP] = Item(); //Advance on the 1-lap track.
prodIdx = (prodIdx + 1) % (2*LAP);//Advance in the 2-lap race.
}
}
}
It's been a while since I originally solved the problem, so this is according to my best recollection. Hopefully I didn't overlook any bugs.
Hope this helps!

Related

When creating threads using lambda expressions, how to give each thread its own copy of the lambda expression?

I have been working on a program that basically used brute force to work backward to find a method using a given set of operations to reach the given number. So, for example, if I gave in a set of operations +5,-7,*10,/3, and a given number say 100(*this example probably won't come up with a solution), and also a given max amount of moves to solve (let's say 8), it will attempt to come up with a use of these operations to get to 100. This part works using a single thread which I have tested in an application.
However, I wanted it to be faster and I came to multithreading. I have worked a long time to even get the lambda function to work, and after some serious debugging have realized that the solution "combo" is technically found. However, before it is tested, it is changed. I wasn't sure how this was possible considering the fact that I had thought that each thread was given its own copy of the lambda function and its variables to use.
In summary, the program starts off by parsing the information, then passes the information which is divided by the parser as paramaters into the array of an operation object(somewhat of a functor). It then uses an algorithm which generated combinations which are then executed by the operation objects. The algorithm, in simplicity, takes in the amount of operations, assigns it to a char value(each char value corresponds to an operation), then outputs a char value. It generates all possible combinations.
That is a summary of how my program works. Everything seems to be working fine and in order other than two things. There is another error which I have not added to the title because there is a way to fix it, but I am curious about alternatives. This way is also probably not good for my computer.
So, going back to the problem with the lambda expression inputted with the thread as seen is with what I saw using breakpoints in the debugger. It appeared that both threads were not generating individual combos, but more rather properly switching between the first number, but alternating combos. So, it would go 1111, 2211, rather than generating 1111, 2111.(these are generated as the previous paragraph showed, but they are done a char at a time, combined using a stringstream), but once they got out of the loop that filled the combo up, combos would get lost. It would randomly switch between the two and never test the correct combo because combinations seemed to get scrambled randomly. This I realized must have something to do with race conditions and mutual exclusion. I had thought I had avoided it all by not changing any variables changed from outside the lambda expression, but it appears like both threads are using the same lambda expression.
I want to know why this occurs, and how to make it so that I can say create an array of these expressions and assign each thread its own, or something similar to that which avoids having to deal with mutual exclusion as a whole.
Now, the other problem happens when I at the end delete my array of operation objects. The code which assigns them and the deleting code is shown below.
operation *operations[get<0>(functions)];
for (int i = 0; i < get<0>(functions); i++)
{
//creates a new object for each operation in the array and sets it to the corresponding parameter
operations[i] = new operation(parameterStrings[i]);
}
delete[] operations;
The get<0>(functions) is where the amount of functions is stored in a tuple and is the number of objects to be stored in an array. The paramterStrings is a vector in which the strings used as parameters for the constructor of the class are stored. This code results in an "Exception trace/breakpoint trap." If I use "*operations" instead I get a segmentation fault in the file where the class is defined, the first line where it says "class operation." The alternative is just to comment out the delete part, but I am pretty sure that it would be a bad idea to do so, considering the fact that it is created using the "new" operator and might cause memory leaks.
Below is the code for the lambda expression and where the corresponding code for the creation of threads. I readded code inside the lambda expression so it could be looked into to find possible causes for race conditions.
auto threadLambda = [&](int thread, char *letters, operation **operations, int beginNumber) {
int i, entry[len];
bool successfulComboFound = false;
stringstream output;
int outputNum;
for (i = 0; i < len; i++)
{
entry[i] = 0;
}
do
{
for (i = 0; i < len; i++)
{
if (i == 0)
{
output << beginNumber;
}
char numSelect = *letters + (entry[i]);
output << numSelect;
}
outputNum = stoll(output.str());
if (outputNum == 23513511)
{
cout << "strange";
}
if (outputNum != 0)
{
tuple<int, bool> outputTuple;
int previousValue = initValue;
for (int g = 0; g <= (output.str()).length(); g++)
{
operation *copyOfOperation = (operations[((int)(output.str()[g])) - 49]);
//cout << copyOfOperation->inputtedValue;
outputTuple = (*operations)->doOperation(previousValue);
previousValue = get<0>(outputTuple);
if (get<1>(outputTuple) == false)
{
break;
}
debugCheck[thread - 1] = debugCheck[thread - 1] + 1;
if (previousValue == goalValue)
{
movesToSolve = g + 1;
winCombo = outputNum;
successfulComboFound = true;
break;
}
}
//cout << output.str() << ' ';
}
if (successfulComboFound == true)
{
break;
}
output.str("0");
for (i = 0; i < len && ++entry[i] == nbletters; i++)
entry[i] = 0;
} while (i < len);
if (successfulComboFound == true)
{
comboFoundGlobal = true;
finishedThreads.push_back(true);
}
else
{
finishedThreads.push_back(true);
}
};
Threads created here :
thread *threadArray[numberOfThreads];
for (int f = 0; f < numberOfThreads; f++)
{
threadArray[f] = new thread(threadLambda, f + 1, lettersPointer, operationsPointer, ((int)(workingBeginOperations[f])) - 48);
}
If any more of the code is needed to help solve the problem, please let me know and I will edit the post to add the code. Thanks in advance for all of your help.
Your lambda object captures its arguments by reference [&], so each copy of the lambda used by a thread references the same shared objects, and so various threads race and clobber each other.
This is assuming things like movesToSolve and winCombo come from captures (it is not clear from the code, but it seems like it). winCombo is updated when a successful result is found, but another thread might immediately overwrite it right after.
So every thread is using the same data, data races abound.
You want to ensure that your lambda works only on two three types of data:
Private data
Shared, constant data
Properly synchronized mutable shared data
Generally you want to have almost everything in category 1 and 2, with as little as possible in category 3.
Category 1 is the easiest, since you can use e.g., local variables within the lambda function, or captured-by-value variables if you ensure a different lambda instance is passed to each thread.
For category 2, you can use const to ensure the relevant data isn't modified.
Finally you may need some shared global state, e.g., to indicate that a value is found. One option would be something like a single std::atomic<Result *> where when any thread finds a result, they create a new Result object and atomically compare-and-swap it into the globally visible result pointer. Other threads check this pointer for null in their run loop to see if they should bail out early (I assume that's what you want: for all threads to finish if any thread finds a result).
A more idiomatic way would be to use std::promise.

Qt Slots called too frequently

I have a Worker Thread that copes with heavy and long computations (up to tenth of seconds). These computations produce several thousands of QLines, representing the edges of a dynamically-growing tree.
These edges can be modified anytime, since they connect the nodes of the trees by checking the cost, represented by the distance.
I would like a smooth update of the QGraphicsScene containing the edges.
I tried with signal and slots:
Worker thread emits a signal, so when the buffer is full this signal gets caught by the main thread, that will cope with the update/drawing of the line
This signal gets still caught by the main thread, but it seems it gets emitted very often, so QGraphicsView gets choked with QLine to be added
Changing the size of the buffer doesn't matter
Is there an alternative approach to this?
The main slot is:
void MainWindow::update_scene(bufferType buffer)
{
for (int i = 0; i < buffer.size(); ++i)
{
if (buffer[i].first < (g_edges.size() - 1))
{
delete g_edges[buffer[i].first];
g_edges[buffer[i].first] = scene->addLine(buffer[i].second);
}
else
g_edges.push_back(scene->addLine(buffer[i].second));
}
}
Note that bufferType is of type QList<std::pair<int,QLine>>.
Here is the heavy computing part
while (T.size() < max_nodes_number && !_stop)
{
const cnode random_node = rand_conf ();
const cnode nearest_node = T.nearest_node (random_node);
cnode new_node = new_conf (nearest_node, random_node);
if (obstacle_free(nearest_node, new_node))
{
QList<cnode*> X_near = T.neighbours (new_node, max_neighbour_radius);
cnode lowest_cost_node = nearest_node;
qreal c_min = nearest_node.cost() + T.distance (nearest_node, new_node);
for (int j = 0; j < X_near.size(); ++j)
{
if (obstacle_free(*X_near[j], new_node) && ((X_near[j]->cost() + T.distance (*X_near[j], new_node)) < c_min))
{
c_min = X_near[j]->cost() + T.distance (*X_near[j], new_node);
lowest_cost_node = *X_near[j];
}
}
T.add_node (new_node, lowest_cost_node.id());
queue (new_node.id(), QLine (new_node.x(), new_node.y(), lowest_cost_node.x(), lowest_cost_node.y()));
for (int j = 0; j < X_near.size(); ++j)
{
if (obstacle_free(*X_near[j], new_node) && (new_node.cost() + T.distance (new_node, *X_near[j])) < X_near[j]->cost())
{
queue (X_near[j]->id(), QLine (new_node.x(), new_node.y(), X_near[j]->x(), X_near[j]->y()));
T.update_parent (*X_near[j], new_node.id());
T.rewire_tree (X_near[j]->id());
}
}
}
}
emit finished();
Please note that T is a class representing a Tree. It is constituted by some methods allowing to add a node, searching for the nearest one, etc. It has a QList<cnode> as private member, storing the tree's nodes. cnode is a structure constituted of two coordinates, an id, a parent, a cost, a list of its children.
The solution is as usual - avoid frequent queued connections, as those are quite slow. Queued connections are a coarse grain construct and such be used as such.
Batch the work. In your scenario, you could aggregate the computed lines in a container, and only when it reaches a certain threshold, pass that container to the main thread to draw/update the lines. The threshold can be count, time or a combination of both, you don't want not updating if there are only a few results to update. You will need to expand on your design to split the while loop to run in the thread event loop instead of blocking so you can aggregate and pass updates periodically - something similar to this. This is always a good idea for workers that take time - you can monitor progress, cancel, pause and all sorts of handy stuff.
Those 2 lines look fishy:
edges.removeAt(i);
edges.insert (i, scene->addLine (l));
Then you remove and then insert - that's an invitation for potential costly reallocation, even without reallocation there is unnecessary copying involved. Instead of removing and inserting you can simply replace the element at that index.
In your case you might omit splitting the actual while loop. Just don't emit in the loop, instead do something like this (pseudocode):
while(...) {
...
queue(new line)
...
queue(update line)
...
queue(final flush)
}
void queue(stuff) {
stuffBuffer.append(stuff)
if (stuffBuffer.size() > 50 || final_flush) {
emit do_stuff(stuffBuffer) // pass by copy
stuffBuffer.clear() // COW will only clear stuffBuffer but not the copy passed above
}
}
Or if it will make you feel better:
copy = stuffBuffer
stuffBuffer.clear()
emit do_stuff(copy)
This way the two containers are detached from the shared data before the copy is emitted.
EDIT: After a long discussion I ended up proposing a number of changes to the design to improve performance (the queued connections were only one aspect of the performance problem):
alleviate the graphics scene - find a compromise between "one item per line" and "one item for all lines" solution, where each item handles the drawing of the lines of its direct children, balancing between the CPU time for adding items to the scene and redrawing items on data changes.
disable automatic scene updates, and instead control the scene update explicitly, this way the scene is not updated for each and every tiny change.
aggregate view commands in batches and submit the work buffer at a fixed interval to avoid queued signals overhead.

openMP slows down when passing from 2 to 4 threads doing binary searches in a custom container

I'm currently having a problem parallelizing a program in c++ using openMP. I am implementing a recommendation system with a user-based collaborative filtering method. To do that, I implemented a sparse_matrix class as a dictionary of dictionaries (where I mean a sort of python dictionary). In my case, since insertion is only done at the beginning of the algorithm when data is read from file, I implemented a dictionary as a std library vector of pair objects (key, value) with a flag that indicates if the vector is sorted. if the vector is sorted, a key is searched using binary searches. otherwise the vector is first sorted and then searched. Alternatively, it is possible to scan the dictionary's entries linearly for example in loops on all the keys of the dictionary. The relevant portion of the code that is causing problems is the following
void compute_predicted_ratings_omp (sparse_matrix &targets,
sparse_matrix &user_item_rating_matrix,
sparse_matrix &similarity_matrix,
int k_neighbors)
{
// Auxiliary private variables
int user, item;
double predicted_rating;
dictionary<int,double> target_vector, item_rating_vector, item_similarity_vector;
#pragma omp parallel shared(targets, user_item_rating_matrix, similarity_matrix)\
private(user, item, predicted_rating, target_vector, item_rating_vector, item_similarity_vector)
{
if (omp_get_thread_num() == 0)
std::cout << " - parallelized on " << omp_get_num_threads() << " threads: " << std::endl;
#pragma omp for schedule(dynamic, 1)
for (size_t iter_row = 0; iter_row < targets.nb_of_rows(); ++iter_row)
{
// Retrieve target user
user = targets.row(iter_row).get_key();
// Retrieve the user rating vector.
item_rating_vector = user_item_rating_matrix[user];
for (size_t iter_col = 0; iter_col < targets.row(iter_row).value().size(); ++iter_col)
{
// Retrieve target item
item = targets.row(iter_row).value().entry(iter_col).get_key();
// retrieve similarity vector associated to the target item
item_similarity_vector = similarity_matrix[item];
// Compute predicted rating
predicted_rating = predict_rating(item_rating_vector,
item_similarity_vector,
k_neighbors);
// Set result in targets
targets.row(iter_row).value().entry(iter_col).set_value(predicted_rating);
}
}
}
}
In this function I compute the predicted rating for a series of target pairs (user, item) (this is simply a weighted average). To do that, I do an outer loop on the target users (which are on the rows of the targets sparse matrix) and I retrieve the rating vector for the current user performing a binary search on the rows of the user_item_rating_matrix. Then, for each column in the current row (i.e. for each item) I retrieve another vector associated to the current item from the sparse matrix similarity_matrix. With these two vectors, I compute the prediction as a weighted average of their elements (on a subset of the items in common between the two vectors).
My problem is the following: I want to parallelize the outer loop using openMP. In the serial version, this functions takes around 3 secs. With openMP on 2 threads, it takes around 2 secs (which it is not bad since I still have some work imbalances in the outerloop). When using 4 threads, it takes 7 secs. I cannot understand what is the cause of this slowdown. Do you have any idea?
I have already thought about the problem and I share my considerations with you:
I access the sparse_matrices only in read mode. Since the matrices
are pre-sorted, all the binary searches should not modify the
matrices and no race-conditions should derive.
Various threads could access to the same vector of the sparse matrix at the same time. I read something about false sharing, but since I do not write in these vectors I think this should not be the reason of the slowdown.
The parallel version seems to work fine with two threads (even if the speedup is lower than expected).
No problem is observed with 4 threads for other choices of the parameters. In particular (cf. "Further details on predict_rating function" below), when I consider all the similar items for the weighted average and I scan the rating vector and search in the similarity vector (the opposite of what I normally do), the execution time scales well on 4 threads.
Further details on predict_rating function: This function works in the following way. The smallest between item_rating_vector and item_similarity_vector is scanned linearly and I do a binary search on the longest of the two. If the rating/similarity is positive, it is considered in the weighted average.
double predict_rating (dictionary<int, double> &item_rating_vector,
dictionary<int, double> &item_similarity_vector)
{
size_t size_item_rating_vector = item_rating_vector.size();
size_t size_item_similarity_vector = item_similarity_vector.size();
if (size_item_rating_vector == 0 || size_item_similarity_vector == 0)
return 0.0;
else
{
double s, r, sum_s = 0.0, sum_sr = 0.0;
int temp_item = 0;
if (size_item_rating_vector < size_item_similarity_vector)
{
// Scan item_rating_vector and search in item_similarity_vector
for (dictionary<int,double>::const_iterator iter = item_rating_vector.begin();
iter != item_rating_vector.end();
++iter)
{
// scan the rating vector forwards: iterate until the whole vector has
// been scanned.
temp_item = (*iter).get_key();
// Retrieve rating that user gave to temp_item (0.0 if not given)
try { s = item_similarity_vector[temp_item]; }
catch (const std::out_of_range &e) { s = 0.0; }
if (s > 0.0)
{
// temp_item is positively similar to the target item. consider it in the average
// Retrieve rating that the user gave to temp_item
r = (*iter).get_value();
// increment the sums
sum_s += s;
sum_sr += s * r;
}
}
}
else
{
// Scan item_similarity_vector and search in item_rating_vector
for (dictionary<int,double>::const_iterator iter = item_similarity_vector.begin();
iter != item_similarity_vector.end();
++iter)
{
// scan the rating vector forwards: iterate until the whole vector has
// been scanned.
temp_item = (*iter).get_key();
s = (*iter).get_value();
if (!(s > 0.0))
continue;
// Retrieve rating that user gave to temp_item (0.0 if not given)
try { r = item_rating_vector[temp_item]; }
catch (const std::out_of_range &e) { r = 0.0; }
if (r > 0.0)
{
// temp_item is positively similar to the target item: increment the sums
sum_s += s;
sum_sr += s * r;
}
}
}
if (sum_s > 0.0)
return sum_sr / sum_s;
else
return 0.0;
}
}
Further details on the hardware: I am running this program on a dell XPS15 with a quad-core i7 processor and 16Gb RAM. I execute the code on a linux virtualbox (I set the VM to use 4 processors and 4Gb RAM).
Thank in advance,
Pierpaolo
It appears you might have a false sharing problem with your targets variable. False sharing is when different threads frequently write to locations near each other (same cache line). By explicitly setting the schedule to dynamic with a chunk size of 1, you are telling OpenMP to only have each thread take tasks one element at a time, thus allowing different threads to work on data that may be near each other in targets.
I would recommend removing the schedule directive just to see how the default scheduler and chunk size do. Then I would try both static and dynamic schedules while varying the chunk size substantially. If your workload or hardware platform is unbalanced, dynamic will probably win, but I would still try static.
Well I found the solution to the problem myself: I post the explanation for the community. In the predict_rating function I used try/catch for handling out_of_range errors thrown by my dictionary structure when a key that is not contained in the dictionary is searched. I read on Are exceptions in C++ really slow that exception handling is computationally heavy in the case an exception is thrown. In my case, for each call of predict_rating I had multiple out_of_range error thrown and handled. I simply removed the try/catch block and wrote a function that searches in the dictionary and return a default value if that key does not exist. This modification produced a speedup of around 2000x and now the program scales well with respect to the number of threads even on the VM.
Thanks to all of you and if you have other suggestions don't hesitate!
Pierpaolo

implement striping algorithm C++

Hi I am having trouble implementing a striping algorithm. I am also having a problem loading 30000 records in one vector, I tried this, but it is not working.
The program should declare variables to store ONE RECORD at a time. It should read a record and process it then read another record, and so on. Each process should ignore records that "belong" to another process. This can be done by keeping track of the record count and determining if the current record should be processed or ignored. For example, if there are 4 processes (numProcs = 4) process 0 should work on records 0, 4, 8, 12, ... (assuming we count from 0) and ignore all the other records in between.`
Residence res;
int numProcs = 4;
int linesNum = 0;
int recCount = 0;
int count = 0;
while(count <= numProcs)
{
while(!residenceFile.eof())
{
++recCount;
//distancess.push_back(populate_distancesVector(res,foodbankData));
if(recCount % processIS == linesNum)
{
residenceFile >> res.x >>res.y;
distancess.push_back(populate_distancesVector(res,foodbankData));
}
++linesNum;
}
++count;
}
Update the code
Residence res;
int numProcs = 1;
int recCount = 0;
while(!residenceFile.eof())
{
residenceFile >> res.x >>res.y;
//distancess.push_back(populate_distancesVector(res,foodbankData));
if ( recCount == processId)//process id
{
distancess.push_back(populate_distancesVector(res,foodbankData));
}
++recCount;
if(recCount == processId )
recCount = 0;
}
update sudo code
while(!residenceFile.eof())
{
residenceFile >> res.x >>res.y;
if ( recCount % numProcs == numLines)
{
distancess.push_back(populate_distancesVector(res,foodbankData));
}
else
++numLines
++recCount
}
You have tagged your post with MPI, but I don't see any place where you are checking a processor ID to see which record it should process.
Pseudocode for a solution to what I think you're asking:
While(there are more records){
If record count % numProcs == myID
ProcessRecord
else
Increment file stream pointer forward one record without processing
Increment Record Count
}
If you know the # of records you will be processing beforehand, then you can come up with a cleverer solution to move the filestream pointer ahead by numprocs records until that # is reached or surpassed.
A process that will act on records 0 and 4 must still read records 1, 2 and 3 (in order to get to 4).
Also, while(!residenceFile.eof()) isn't a good way to iterate through a file; it will read one round past the end. Do something like while(residenceFile >> res.x >>res.y) instead.
As for making a vector that contains 30,000 records, it sounds like a memory limitation. Are you sure you need that many in memory at once?
EDIT:
Look carefully at the updated code. If the process ID (numProcs) is zero, the process will act on the first record and no other; if it is something else, it will act on none of them.
EDIT:
Alas, I do not know Arabic. I will try to explain clearly in English.
You must learn a simple technique, before you attempt a difficult technique. If you guess at the algorithm, you will fail.
First, write a loop that iterates {0,1,2,3,...} and prints out all of the numbers:
int i=0;
while(i<10)
{
cout << i << endl;
++i;
}
Understand this before going farther. Then write a loop that iterates the same way, but prints out only {0,4,8,...}:
int i=0;
while(i<10)
{
if(i%4==0)
cout << i << endl;
++i;
}
Understand this before going farther. Then write a loop that prints out only {1,5,9,...}. Then write a loop that reads the file, and reports on every record. Then combine that with the logic from the previous exercise, and report on only one record out of every four.
Start with something small and simple. Add complexity in small measures. Develop new techniques in isolation. Test every step. Never add to code that doesn't work. This is the way to write code that works.

randomly choosing an empty vector element, when it is possible to know beforehand which are full

I finally determined that this function is responsible for the majority of my bottleneck issues. I think its because of the massively excessive random access that happens when most of the synapses are already active. Basically, as the title says, I need to somehow optimize the algorithm so that I'm not randomly checking a ton of active elements before landing on one of the few that are left.
Also, I included the whole function in case of other flaws that can be spotted.
void NetClass::Explore(vector <synapse> & synapses, int & n_syns) //add new synapses
{
int size = synapses.size();
assert(n_syns <= size );
//Increase the age of each active synapse by 1
Age_Increment(synapses);
//make sure there is at least one inactive vector left
if(n_syns == size)
return;
//stochastically decide whether a new connection is added
if((rand_r(seedp) %1000) < ( x / (1 +(n_syns * ( y / 100)))))
{
n_syns++; //a new synapse has been created
//main inefficiency here
while(1)
{
int syn = rand_r(seedp) % (size);
if (!synapses[syn].active)
{
synapses[syn].active = true;
synapses[syn].weight = .04 + (float (rand_r(seedp) % 17) / 100);
break;
}
}
}
}
void NetClass::Age_Increment(vector <synapse> & synapses)
{
for(int q=0, int size = synapses.size(); q < size; q++)
if(synapses[q].active)
synapses[q].age++;
}
Pass a random number, k, in the range [0, size-n_syns) to Age_Increment. Have Age_Increment return the kth empty slot.
Since you're already traversing the whole list in Age_Increment, update that function to return the list of the indexes of inactive synapses.
You can then pick a random item from that list directly.
This is similar to the problem of finding free blocks in memory management, so I would take a look at algorithms used in that domain, specifically free lists, which is a list of free positions. (These are usually implemented as linked lists to be able to pop elements off an end efficiently. Random access in a linked list would still be O(n) - with a smaller n, but still not the best choice for your use case.)