iPhone: NSOperationQueue running operations serially - concurrency

I have a singleton NSOperationQueue that handles all of my network requests. I'm noticing, however, that when I have one particularly long operation running (this particular operation takes at least 25 seconds), my other operations don't run until it completes.
maxConcurrentOperationCount is set to NSOperationQueueDefaultMaxConcurrentOperationCount, so I don't believe that's the issue.
Any reason why this would be happening? Besides spawning multiple NSOperationQueues (a solution that I'm not sure would work, nor am I sure it's a good idea), what's the best way to fix this problem?
Thanks.

According to the NSOperationQueue class referenceNSOperationQueueDefaultMaxConcurrentOperationCount means that "the default maximum number of operations is determined dynamically by the NSOperationQueue object based on current system conditions." I don't see anything that says the default will be > 1 (especially on a single-CPU system).
Try calling -setMaxConcurrentOperationCount: to explicitly set a larger value, and see if that helps.

Related

Unable to fix score trap issue in Optaplanner for a variation of the Task Scheduling Problem

I am working on a variation of the Task scheduling problem. Here are the rules:
Each Task has a start time to be chosen by the optimizer
Each Task requires multiple types of resources (Crew Members) who work in parallel to complete the task. i.e the task can start only when all required types of crew members are available.
There are multiple crew members of a certain type and the optimizer has to choose the crew member of each type for a task. Eg Task A requires an Electrician and a Plumber. There are many electricians and plumbers to choose from.
Here is my domain.
I have created a planning entity called TaskAssignment with 2 planning variables CrewMember and Starttime.
So, if a Task requires 3 types of crew members, then 3 TaskAssignment entities would be associated with it.
I placed a hard constraint to force the Starttime planning variable to be same for all the TaskAssignments corresponding to a particular task.
This works perfectly when I do not add any soft constraints (For example to reduce the total cost of using the resources). But when I add the soft constraint, there seems to be a violation of 1 hard constraint.
My guess if that this is due to a score trap because the starttimes are not changing as a set.
Note: I have tried to avoid using the PlanningList variable. Can anyone suggest a way to solve this issue ?
Your issue appears to be that your scoring function expects all employees on a given task to move simultaneously, but the solver actually moves them one by one. This is a problem in your domain model - it allows this situation to happen, because you only ever assign one employee at a time.
There are two ways of fixing this problem:
Fix your model, so that this is not allowed. For example, if you know that each task requires two people, let there be two variables on the entity, one for each employee. If there is a certain maximum of people per task, have a variable for each, and make them nullable, so that unassigned slots are not an issue. If you don't have a fixed amount of employees per task and you can not get to a reasonable maximum, then this approach will likely not work for you. In that case...
Write coarse-grained custom moves which always move all the employees together.

How to implement atomic reference counter that does not overflow?

I was thinking about reference counting based on atomic integers that would be safe from overflow. How to do it?
Please let's not focus on whether such overflow is a realistic problem or not. The task itself got my interest even if not practically important.
Example
Example implementation of reference counting is shown as an example in Boost.Atomic. Based on that example we can extract following sample code:
struct T
{
boost::atomic<boost::uintmax_t> counter;
};
void add_reference(T* ptr)
{
ptr->counter.fetch_add(1, boost::memory_order_relaxed);
}
void release_reference(T* ptr)
{
if (ptr->counter.fetch_sub(1, boost::memory_order_release) == 1) {
boost::atomic_thread_fence(boost::memory_order_acquire);
delete ptr;
}
}
In addition following explanation is given
Increasing the reference counter can always be done with memory_order_relaxed: New references to an object can only be formed from an existing reference, and passing an existing reference from one thread to another must already provide any required synchronization.
It is important to enforce any possible access to the object in one thread (through an existing reference) to happen before deleting the object in a different thread. This is achieved by a "release" operation after dropping a reference (any access to the object through this reference must obviously happened before), and an "acquire" operation before deleting the object.
It would be possible to use memory_order_acq_rel for the fetch_sub operation, but this results in unneeded "acquire" operations when the reference counter does not yet reach zero and may impose a performance penalty.
EDIT >>>
It seems that Boost.Atomic documentation might be wrong here. The acq_rel might be needed after all.
At least such is the implementation of boost::shared_ptr when done using std::atomic (there are other implementations as well). See file boost/smart_ptr/detail/sp_counted_base_std_atomic.hpp.
Also Herb Sutter mentions it in his lecture C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 2 of 2 (reference counting part starts at 1:19:51). Also he seems to be discouraging use of fences in this talk.
Thanks to user 2501 for pointing that out in comments below.
<<< END EDIT
Initial attempts
Now the problem is that add_reference as written could (at some point) overflow. And it would do so silently. Which obviously could lead to problems when calling matched release_reference that would prematurely destroy the object. (Provided that add_reference would be then called once again to reach 1.)
I was thinking how to make add_reference detect overflow and fail gracefully without risking anything.
Comparing to 0 once we leave fetch_add will not do since between the two some other thread could call add_reference again (reaching 1) and then release_reference (erroneously destroying the object in effect).
Checking first (with load) will not help either. This way some other thread could add its own reference between our calls to load and fetch_add.
Is this the solution?
Then I thought that maybe we could start with load but only if then we do compare_exchange.
So first we do load and get a local value. If it is std::numeric_limits<boost::uintmax_t>::max() then we fail. add_reference cannot add another reference as all possible are already taken.
Otherwise we make another local value which is the previous local reference count plus 1.
And now we do compare_exchange providing as expected value the original local reference count (this ensures that no other thread modified reference count in the mean time) and as the desired value the incremented local reference count.
Since compare_exchange can fail we have to do this (including load) in a loop. Until it succeeds or max value is detected.
Some questions
Is such solution correct?
What memory ordering would be required to make it valid?
Which compare_exchange should be used? _weak or _strong?
Would it affect release_reference function?
Is it used in practice?
The solution is correct, maybe it could be improved with one thing. Currently, if the value reaches max in the local CPU, it can be decreased by another CPU but the current CPU would still cache the old value. It would be worth doing dummy compare_exchange with the same expected and newValue to confirm the max is still there and only then throw an exception (or whatever you want).
For the rest:
It doesn't matter whether you use _weak or _strong as it will run in loop anyway and therefore the next load will get quite reliably the latest value.
For the add_reference and release_reference - who would then check whether it was really added or not? Would it throw an exception. If yes, it would work probably. But generally it's better to allow such a low level things not to fail and rather use uintptr_t for the reference counter so it could never overflow as it's big enough to cover the address space and therefore any number of objects existing at the same time.
No, it's not used in practice for the above reasons.
Quick math: say uint is 32 bits, so max uint is 4G (4 billion something). Each reference/pointer is at least 4 bytes (8 if you are on a 64 bit system) so to overflow you need 16Gbytes of memory dedicated to storing references pointing to the same object, which should point to a serious deign flaw.
I would say it's not a problem today, nor in the foreseeable future.
This question is moot. Even assuming atomic increment takes 1 CPU cycle (it does not!), on 4GHz CPU it would take half a year to wrap around 64bit integer, providing CPU does nothing else but keep incrementing.
Taking into account realities of an actual program, I find it hard to believe this is a real issue which can pester you.

C++ program stability after millions of executions

I have a program in C++ that performs mainly matrix multiplcations, additions and so on.
The problem is, a EXC_BAD_ACCESS happens when the calculation performs for about 3 million times.
Is there any possible problems that can arise when a problem is executed for millions of times and for several hours?
Details of the program:
The program is simply calculations on different ranges of values, so it is executing on 6 threads at the same time. There is no resource sharing between the threads.
There seems be no evident problem in the program since:
there is no memory leak, I've confirmed this using Instruments, and the memory size of the program is stable.
the program can execute for at least 2 million times on each thread without any problem, but it is almost guaranteed that the EXC_BAD_ACCESS exception arises some time, on some thread. (the exception happens in my 2 tries of the program (2/2) )
About the matrix multiplication:
Sometimes the size of the matrices is about 2*2 multiply 2*1000.
The elements of the matrix is a custom Complex Number class.
the values of the elements are randomly generated by rand() and converted to float.
the structure is like this:
class Complex
{
private:
float _real, _imag;
public:
// getters, setters and overloaded operators
};
class Matrix
{
private:
Complex **_values;
int _row,_col;
public:
getters, setters and overloaded operators
};
Thank you very much!
Any possible reason for the crash is greatly welcomed!
EXC_BAD_ACCESS means that you dereferenced a pointer which doesn't point into your process's current memory space. This is a bug in your code. Run it under a debugger until it fails and then have a look at the variable values in the statement where it fails. It could be simple or exceedingly subtle.
There's too little information in your post to make a decisive answer. However, it might be that no information available to you now would change it, and you need to debug the case more carefully. Here's what I'd do.
To debug, you want repeatability. But… you say that you're using random numbers. It seems though, that what your program does is some scientific-ish computations. In most cases you don't actually need “true” randomness, but “repeatable” randomness—randomness which passes statistical tests, but where you have enough data to reset the random number generator so that it will produce the exactly the same results as in a previous run. For that, you can just write down the current RNG state (e.g. seed) every time you start a new block of computation.
Now, write some piece of code that will store all the state necessary to restart computations (including RNG) once every few minutes, and run the program. This way, if your code crashes, you will be able to restart the computations with the same exact state and get to the point where it crashed without waiting for millions of iterations. I am putting a strong assumption here, that except for RNG your code does not depend on any other kind of external state (like, network activity, IO, process scheduler making certain choices when scheduling your threads…)
With this kind of data it will be easier to test if the problem is due to a machine fault (overheating, bad memory, etc.). Simply restart the computation with the last state before crashing—preferably after letting the machine cool down, maybe restarting it… if you'll encounter another crash (and it will happen every time you try to restart code), it's quite certain it's due to a bug in your code.
If not, we still cannot say that it's machine fault—your code might (by pure accident/mistake in code) crash due to an undefined behavior which depends on factors out of your control. Examples include using an uninitialized pointer in a rarely-taken code path: it might throw bad access sometimes, and go unnoticed if by pure luck the pointer points to memory you allocated. Try valgrind, this is probably the best tool to check for memory problems… except that it slows down execution so much that you'll again prefer to rerun the computations from a state known to be suspicious (the last state before crash) instead of waiting for millions of iterations. I've seen slowdowns of 5x to 100x.
In the meantime, try running your code on another machine. If you'll also get crashes after a similar number of iterations (to be sure wait for at least 3 times more iterations than it took to crash on the original machine), then it's quite probable that it's a bug in your code.
Happy hacking!
Calculations with finite precision that fail after a few million iterations? That could be accumulated round-off error. Problem is, those usually exhibit themselves as division by zero or other mathematical errors. EXC_BAD_ACCESS is not. However, there's one case in which this can happen: when you use the mathematical result as an array index.

What is faster in CUDA: global memory write + __threadfence() or atomicExch() to global memory?

Assuming that we have lots of threads that will access global memory sequentially, which option performs faster in the overall? I'm in doubt because __threadfence() takes into account all shared and global memory writes but the writes are coalesced. In the other hand atomicExch() takes into account just the important memory addresses but I don't know if the writes are coalesced or not.
In code:
array[threadIdx.x] = value;
Or
atomicExch(&array[threadIdx.x] , value);
Thanks.
On Kepler GPUs, I would bet on atomicExch since atomics are very fast on Kepler. On Fermi, it may be a wash, but given that you have no collisions, atomicExch could still perform well.
Please make an experiment and report the results.
Those two do very different things.
atomicExch ensures that no two threads try to modify a given cell at a time. If such conflict would occur, one or more threads may be stalled. If you know beforehand that no two threads access the same cell, there is no point to use any atomic... function.
__threadfence() delays the current thread (and only the current thread!) to ensure that any subsequent writes by given thread do actually happen later.
As such, __threadfence() on its own, without any follow-up code is not very interesting.
For that reason, I don't think there is a point to compare the efficiency of those two. Maybe if you could show a bit more concrete use case I could relate...
Note, that neither of those actually give you any guarantees on the actual order of execution of the threads.

How to write a test case for ensuring thread-safe

I wrote a thread-safe(at least the aim is that) container class in C++. I lock mutexes while accessing the member and release when finished.
Now, I try to write a test case if it is really thread safe.
Let's say, I have Container container and two threads Thread1 Thread2.
Container container;
Thread1()
{
//Add N items to the container
}
Thread2()
{
//Add N items to the container
}
In this way, it works with no problem with N=1000.
But I'm not sure this regression test is enough or not. Is there a deterministic way to test a class like that?
Thanks.
there is no real way to write a test to prove its safe.
you can only design it so it is safe and test that your design is implemented. best you can do is stress test it.
I guess that you wrote a generic container and that you want to verify that two different threads cannot insert items on the same time.
If my assumptions are correct, then my proposition would be to write a custom class in wich you overload the copy constructor, inserting a sleep that could be parametrized.
To test your container, create an instance of it for your custom class and then in the first thread, insert an instance of the custom class with a long sleep, meanwhile you start the second thread trying to insert an instance of the custom class with a short sleep. If the second insertion comes back before the first one, you know that the test failed.
That's a reasonable starting point, though I'd make a few suggestions:
Run the test on a quad-core machine to improve the odds of real resource contention.
Instead of having a fixed number of threads, I'd suggest spawning a random number of threads with a lower bound equal to the number of processors on the test machine and an upper bound that's four times that number.
Consider doing occasional runs with a substantially larger number of items (say 100,000).
Run your tests on optimized, release (non-debug) builds.
If you're targeting Windows, you may want to consider using critical sections rather than mutexes as they're generally more performant.
Proving that it's safe is not possible, but for improving the stress-testing chances of finding bugs, you can modify the container's add method so looks like this:
// Assuming all this is thread safe
if ( in_use_flag == true ) {
error!
}
in_use_flag = true;
... original add method code ....
sleep( long_time );
in-use-flag = false;
This way you can almost make sure that the two threads would try to access the container at the same time, and also check for such occurrences - thus making sure the thread-safety actually works.
PS I would also remove the mutex protection just to see it fail once.