C++11 : Atomic variable : lock_free property : What does it mean? - c++

I wanted to understand what does one mean by lock_free property of atomic variables in c++11. I did googled out and saw the other relevant questions on this forum but still having partial understanding. Appreciate if someone can explain it end-to-end and in simple way.

It's probably easiest to start by talking about what would happen if it was not lock-free.
The most obvious way to handle most atomic tasks is by locking. For example, to ensure that only one thread writes to a variable at a time, you can protect it with a mutex. Any code that's going to write to the variable needs obtain the mutex before doing the write (and release it afterwards). Only one thread can own the mutex at a time, so as long as all the threads follow the protocol, no more than one can write at any given time.
If you're not careful, however, this can be open to deadlock. For example, let's assume you need to write to two different variables (each protected by a mutex) as an atomic operation -- i.e., you need to ensure that when you write to one, you also write to the other). In such a case, if you aren't careful, you can cause a deadlock. For example, let's call the two mutexes A and B. Thread 1 obtains mutex A, then tries to get mutex B. At the same time, thread 2 obtains mutex B, and then tries to get mutex A. Since each is holding one mutex, neither can obtain both, and neither can progress toward its goal.
There are various strategies to avoid them (e.g., all threads always try to obtain the mutexes in the same order, or upon failure to obtain a mutex within a reasonable period of time, each thread releases the mutex it holds, waits some random amount of time, and then tries again).
With lock-free programming, however, we (obviously enough) don't use locks. This means that a deadlock like above simply cannot happen. When done properly, you can guarantee that all threads continuously progress toward their goal. Contrary to popular belief, it does not mean the code will necessarily run any faster than well written code using locks. It does, however, mean that deadlocks like the above (and some other types of problems like livelocks and some types of race conditions) are eliminated.
Now, as to exactly how you do that: the answer is short and simple: it varies -- widely. In a lot of cases, you're looking at specific hardware support for doing certain specific operations atomically. Your code either uses those directly, or extends them to give higher level operations that are still atomic and lock free. It's even possible (though only rarely practical) to implement lock-free atomic operations without hardware support (but given its impracticality, I'll pass on trying to go into more detail about it, at least for now).

Jerry already mentioned common correctness problems with locks, i.e. they're hard to understand and program correctly.
Another danger with locks is that you lose determinism regarding your execution time: if a thread that has acquired a lock gets delayed (e.g. descheduled by the operating system, or "swapped out"), then it is pos­si­ble that the entire program is de­layed be­cause it is waiting for the lock. By contrast, a lock-free al­go­rithm is al­ways gua­ran­teed to make some progress, even if any number of threads are held up some­where else.
By and large, lock-free programming is often slower (sometimes significantly so) than locked pro­gram­ming using non-atomic operations, because atomic operations cause a sig­ni­fi­cant hit on caching and pipelining; however, it offers determinism and upper bounds on latency (at least overall latency of your process; as #J99 observed, individual threads may still be starved as long as enough other threads are making progress). Your program may get a lot slower, but it never locks up entirely and always makes some progress.
The very nature of hardware architectures allows for certain small operations to be intrinsically atomic. In fact, this is a very necessity of any hardware that supports multitasking and multithreading. At the very heart of any synchronisation primitive, such as a mutex, you need some sort of atomic instruction that guarantees correct locking behaviour.
So, with that in mind, we now know that it is possible for certain types like booleans and machine-sized integers to be loaded, stored and exchanged atomically. Thus when we wrap such a type into an std::atomic template, we can expect that the resulting data type will indeed offer load, store and exchange operations that do not use locks. By contrast, your library implementation is always allowed to implement an atomic Foo as an ordinary Foo guarded by a lock.
To test whether an atomic object is lock-free, you can use the is_lock_free member function. Additionally, there are ATOMIC_*_LOCK_FREE macros that tell you whether atomic primitive types potentially have a lock-free instantiation. If you are writing concurrent algorithms that you want to be lock-free, you should thus include an assertion that your atomic objects are indeed lock-free, or a static assertion on the macro to have value 2 (meaning that every object of the corresponding type is always lock-free).

Lock-free is one of the non-blocking techniques. For an algorithm, it involves a global progress property: whenever a thread of the program is active, it can make a forward step in its action, for itself or eventually for the other.
Lock-free algorithms are supposed to have a better behavior under heavy contentions where threads acting on a shared resources may spent a lot of time waiting for their next active time slice. They are also almost mandatory in context where you can't lock, like interrupt handlers.
Implementations of lock-free algorithms almost always rely on Compare-and-Swap (some may used things like ll/sc) and strategy where visible modification can be simplified to one value (mostly pointer) change, making it a linearization point, and looping over this modification if the value has change. Most of the time, these algorithms try to complete jobs of other threads when possible. A good example is the lock-free queue of Micheal&Scott (http://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html).
For lower-level instructions like Compare-and-Swap, it means that the implementation (probably the micro-code of the corresponding instruction) is wait-free (see http://www.diku.dk/OLD/undervisning/2005f/dat-os/skrifter/lockfree.pdf)
For the sake of completeness, wait-free algorithm enforces progression for each threads: each operations are guaranteed to terminate in a finite number of steps.

Related

Anything in std::atomic is wait-free?

If T is a C++ fundamental type, and if std::atomic<T>::is_lock_free() returns true, then is there anything in std::atomic<T> that is wait-free (not just lock-free)? Like, load, store, fetch_add, fetch_sub, compare_exchange_weak, and compare_exchange_strong.
Can you also answer based on what is specified in the C++ Standard, and what is implemented in Clang and/or GCC (your version of choice).
My favorite definitions for lock-free and wait-free are taken from C++ Concurrency in Action (available for free). An algorithm is lock-free if it satisfies the first condition bellow, and it is wait-free if it satisfies both conditions below:
If one of the threads accessing the data structure is suspended by the scheduler midway through its operation, the other threads must still be able to complete their operations without waiting for the suspended thread.
Every thread accessing the data structure can complete its operation within a bounded number of steps, regardless of the behavior of other threads.
There exist universally accepted definitions of lock-freedom and wait-freedom, and the definitions provided in your question are consistent with those. And I would strongly assume that the C++ standard committee certainly sticks to definitions that are universally accepted in the scientific community.
In general, publications on lock-free/wait-free algorithms assume that CPU instructions are wait-free. Instead, the arguments about progress guarantees focus on the software algorithm.
Based on this assumption I would argue that any std::atomic method that can be translated to a single atomic instruction for some architecture is wait-free on that specific architecture. Whether such a translation is possible sometimes depends on how the method is used though. Take for example fetch_or. On x86 this can be translated to lock or, but only if you do not use its return value, because this instruction does not provide the original value. If you use the return value, then the compiler will create a CAS-loop, which is lock-free, but not wait-free. (And the same goes for fetch_and/fetch_xor.)
So which methods are actually wait-free depends not only on the compiler, but mostly on the target architecture.
Whether it is technically correct to assume that a single instruction is actually wait-free or not is a rather philosophical one IMHO. True, it might not be guaranteed that an instruction finishes execution within a bounded number of "steps" (whatever such a step might be), but the machine instruction is still the smallest unit on the lowest level that we can see and control. Actually, if you cannot assume that a single instruction is wait-free, then strictly speaking it is not possible to run any real-time code on that architecture, because real-time also requires strict bounds on time/the number of steps.
This is what the C++17 standard states in [intro.progress]:
Executions of atomic functions that are either defined to be lock-free (32.8) or indicated as lock-free (32.5)
are lock-free executions.
If there is only one thread that is not blocked (3.6) in a standard library function, a lock-free execution in that thread shall complete. [ Note: Concurrently executing threads may prevent progress of a lock-free
execution. For example, this situation can occur with load-locked store-conditional implementations. This property is sometimes termed obstruction-free. — end note ]
When one or more lock-free executions run concurrently, at least one should complete. [ Note: It is difficult for some implementations to provide absolute guarantees to this effect, since repeated and particularly inopportune interference from other threads may prevent forward progress, e.g., by repeatedly stealing a cache line for unrelated purposes between load-locked and store-conditional instructions. Implementations should ensure that such effects cannot indefinitely delay progress under expected operating conditions, and that such anomalies can therefore safely be ignored by programmers. Outside this document, this property is sometimes termed lock-free. — end note ]
The other answer correctly pointed out that my original answer was a bit imprecise, since there exist two stronger subtypes of wait-freedom.
wait-free - A method is wait-free if it guarantees that every call finishes its execution in a finite number of steps, i.e., it is not possible to determine an upper bound, but it must still be guaranteed that the number of steps is finite.
wait-free bounded - A method is wait-free bounded if it guarantees that every call finishes its execution in a bounded number of steps, where this bound may depend on the number of threads.
wait-free population oblivious - A method is wait-free population oblivious if it guarantees that every call finishes its execution in a bounded number of steps, and this bound does not depend on the number of threads.
So strictly speaking, the definition in the question is consistent with the definition of wait-free bounded.
In practice, most wait-free algorithms are actually wait-free bounded or even wait-free population oblivious, i.e., it is possible to determine an upper bound on the number of steps.
Since there are many definitions of wait-freedom1 and people choose different ones, I think that a precise definition is paramount, and a distinction between its specializations is necessary and useful.
These are the universally accepted definitions of wait-freedom and its specializations:
wait-free:
All threads will make progress in a finite number of steps.
wait-free bounded:
All threads will make progress in a bounded number of steps, which may depend on the number of threads.
wait-free population-oblivious3:
All threads will make progress in a fixed number of steps, that does not depend on the number of threads.
Overall, the C++ standard makes no distinction between lock-free and wait-free (see this other answer). It always gives guarantees no stronger than lock-free.
When std::atomic<T>::is_lock_free() returns true, instead of mutexes the implementation utilizes atomic instructions possibly with CAS loops or LL/SC loops.
Atomic instructions are wait-free. CAS and LL/SC loops are lock-free.
How a method is implemented depends on many factors, including its usage, the compiler, the target architecture and its version. For example:
As someone says, on x86 gcc, fetch_add() for std::atomic<double> uses a CAS loop (lock cmpxchg), while for std::atomic<int> uses lock add or lock xadd.
As someone else says, on architectures featuring LL/SC instructions, fetch_add() uses a LL/SC loop if no better instructions are available. For example, this is not the case on ARM versions 8.1 and above, where ldaddal is used for non-relaxed std::atomic<int> and ldadd is used if relaxed.
As stated in this other answer, on x86 gcc fetch_or() uses lock or if the return value is not used, otherwise it uses a CAS loop (lock cmpxchg).
As explained in this answer of mine:
The store() method and lock add, lock xadd, lock or instructions are wait-free population-oblivious, while their "algorithm", that is the work performed by the hardware to lock the cache line, is wait-free bounded.
The load() method is always wait-free population-oblivious.
1 For example:
all threads will make progress in a finite number of steps (source)
all threads will make progress in a bounded number of steps2
per "step" that they all execute, all threads will make forward progress without any starvation (source)
2 It is not clear whether the bound is constant, or it may depend on the number of threads.
3 A strange name and not good for an acronym, so maybe another one should be chosen.

Fast way : Synchronization with mutex and use of Atomic Boolean in C++ [duplicate]

The main reason for using atomics over mutexes, is that mutexes are expensive but with the default memory model for atomics being memory_order_seq_cst, isn't this just as expensive?
Question: Can concurrent a program using locks be as fast as concurrent lock-free program?
If so, it may not be worth the effort unless I want to use memory_order_acq_rel for atomics.
Edit:
I may be missing something but lock-based cant be faster than lock-free because each lock will have to be a full memory barrier too. But with lock-free, it's possible to use techniques that are less restrictive then memory barriers.
So back to my question, is lock-free any faster than lock based in new C++11 standard with default memory_model?
Is "lock-free >= lock-based when measured in performance" true? Let's assume 2 hardware threads.
Edit 2:
My question is not about progress guarantees, and maybe I'm using "lock-free" out of context.
Basically when you have 2 threads with shared memory, and the only guarantee you need is that if one thread is writing then the other thread can't read or write, my assumption is that a simple atomic compare_and_swap operation would be much faster than locking a mutex.
Because if one thread never even touches the shared memory, you will end up locking and unlocking over and over for no reason but with atomic operations you only use 1 CPU cycle each time.
In regards to the comments, a spin-lock vs a mutex-lock is very different when there is very little contention.
Lockfree programming is about progress guarantees: From strongest to weakest, those are wait-free, lock-free, obstruction-free, and blocking.
A guarantee is expensive and comes at a price. The more guarantees you want, the more you pay. Generally, a blocking algorithm or datastructure (with a mutex, say) has the greatest liberties, and thus is potentially the fastest. A wait-free algorithm on the other extreme must use atomic operations at every step, which may be much slower.
Obtaining a lock is actually rather cheap, so you should never worry about that without a deep understanding of the subject. Moreover, blocking algorithms with mutexes are much easier to read, write and reason about. By contrast, even the simplest lock-free data structures are the result of long, focused research, each of them worth one or more PhDs.
In a nutshell, lock- or wait-free algorithms trade worst latency for mean latency and throughput. Everything is slower, but nothing is ever very slow. This is a very special characteristic that is only useful in very specific situations (like real-time systems).
A lock tends to require more operations than a simple atomic operation does. In the simplest cases, memory_order_seq_cst will be about twice as fast as locking because locking tends to require, at minimum two atomic operations in its implementation (one to lock, one to unlock). In many cases, it takes even more than that. However, once you start leveraging the memory orders, it can be much faster because you are willing to accept less synchronization.
Also, you'll often see "locking algorithms are always as fast as lock free algorithms." This is somewhat true. The basic idea is that if the fastest algorithm happens to be lock free, then the fastest algorithm without the lock-free guarentee is ALSO the same algorithm! However, if the fastest algortihm requires locks, then those demanding lockfree guarantees have to go find a slower algorithm.
In general, you will see lockfree algorithms in a few low level algorithms, where the performance of leveraging specialized opcodes helps. In almost all other code, locking is more than satisfactory performance, and much easier to read.
Question: Can concurrent a program using locks be as fast as
concurrent lock-free program?
It can be faster: lock free algorithm must keep the global state in a consistent state at all time, and do calculations without knowing if they will be productive as the state might have changed when the calculation is done, making it irrelevant, with lost CPU cycles.
The lock free strategy makes the serialization happen at the end of the process, when the calculation is done. In a pathological case many threads can do an effort and only one effort will be productive, and the others will retry.
Lock free can lead to starvation of some threads, whatever their priority is, and there is no way to avoid that. (Although it's unlikely for a thread to starve retrying for very long unless there is crazy contention.)
On the other hand, "serialized calculation and series of side effect based" (aka lock based) algorithms will not start before they know they will not be prevented by other actors to operate on that specific locked ressource (the guarantee is provided by the use of a mutex). Note that they might be prevented from finishing by the need to access another resource, if multiple locks are taken, leading to possible dead lock when multiple locks are needed in a badly designed program.
Note that this dead lock issue isn't in the scope of lock free code, which can't even act on multiple entities: it usually can't do an atomic commit based on two unrelated objects(1).
So the lack of chance of dead lock for lock free code is sign of weakness of lock free code: not being able to dead lock is a limit of your tool. A system that can only hold of lock at a time also wouldn't be able to dead lock.
The scope of lock free algorithms is minuscule compared to the scope of lock based algorithms. For a lot of problem, lock free doesn't even make sense.
A lock based algorithm is polite, the threads will have to wait in line before doing what they need to do: that is maximally efficient in term of computation steps by each thread. But it's inefficient to have to queue threads in a wait list: they often can't use the end of their time slice, so it can be very inefficient, as someone trying to do serious work while being interrupted by the phone all the time: his concentration is gone and he can't never reach maximum efficiency because his work time to cut into small pieces.
(1) You would have at least need to be able to do a double CAS for that, that is an operation atomic on two arbitrary addresses (not a double word CAS, which is just a CAS on more bits, which can trivially be implemented up to the natural CPU memory access arbitration unit that is the cache line).

C++ Atomic/Mutex What way to follow?

I was wondering what is the better choice: it's assumed there is a trivially copyable object, let's say a queue data structure, that is used by several threads to pop/push data. The object provides only methods put/push, that can't be accessed by more than one thread the same time. Obviously if put is called, push can't be called neither.
Would you suggest to wrap the model into atomic type (if possible), or rather use mutexes?
Regards!
Atomic is hardware thing, whereas mutex is OS thing. Mutex will end up by suspending the task, even though in some cases mutex will behave as a spinlock for a short period of time aka "optimistic spin", see https://lore.kernel.org/all/56C2673F.6070202#hpe.com/T/
So, if you have small operations like incrementing a variable, aka "atomic", without waiting for other things which might take longer, then atomic is for you.
If you want to (indefinitely) wait for some things to happen in other threads, polling for results via atomics, aka spinlock, might be a waste of CPU cycles therefore less cooperative, so it's better to use a mutex/condition variable which would suspend the task at a price of context switch latency.
Atomic is preferable for those kinds of cases. The atomic is a kind of operation supported by the CPU specifically whereas the other kinds of thread control tend to be implemented by the OS or other measures and incur more overhead.
EDIT: A quick search shows up this which has more info and is basically the same kind of question: Which is more efficient, basic mutex lock or atomic integer?
EDIT 2: And a more detailed article here http://www.informit.com/articles/article.aspx?p=1832575

Is synchronizing with `std::mutex` slower than with `std::atomic(memory_order_seq_cst)`?

The main reason for using atomics over mutexes, is that mutexes are expensive but with the default memory model for atomics being memory_order_seq_cst, isn't this just as expensive?
Question: Can concurrent a program using locks be as fast as concurrent lock-free program?
If so, it may not be worth the effort unless I want to use memory_order_acq_rel for atomics.
Edit:
I may be missing something but lock-based cant be faster than lock-free because each lock will have to be a full memory barrier too. But with lock-free, it's possible to use techniques that are less restrictive then memory barriers.
So back to my question, is lock-free any faster than lock based in new C++11 standard with default memory_model?
Is "lock-free >= lock-based when measured in performance" true? Let's assume 2 hardware threads.
Edit 2:
My question is not about progress guarantees, and maybe I'm using "lock-free" out of context.
Basically when you have 2 threads with shared memory, and the only guarantee you need is that if one thread is writing then the other thread can't read or write, my assumption is that a simple atomic compare_and_swap operation would be much faster than locking a mutex.
Because if one thread never even touches the shared memory, you will end up locking and unlocking over and over for no reason but with atomic operations you only use 1 CPU cycle each time.
In regards to the comments, a spin-lock vs a mutex-lock is very different when there is very little contention.
Lockfree programming is about progress guarantees: From strongest to weakest, those are wait-free, lock-free, obstruction-free, and blocking.
A guarantee is expensive and comes at a price. The more guarantees you want, the more you pay. Generally, a blocking algorithm or datastructure (with a mutex, say) has the greatest liberties, and thus is potentially the fastest. A wait-free algorithm on the other extreme must use atomic operations at every step, which may be much slower.
Obtaining a lock is actually rather cheap, so you should never worry about that without a deep understanding of the subject. Moreover, blocking algorithms with mutexes are much easier to read, write and reason about. By contrast, even the simplest lock-free data structures are the result of long, focused research, each of them worth one or more PhDs.
In a nutshell, lock- or wait-free algorithms trade worst latency for mean latency and throughput. Everything is slower, but nothing is ever very slow. This is a very special characteristic that is only useful in very specific situations (like real-time systems).
A lock tends to require more operations than a simple atomic operation does. In the simplest cases, memory_order_seq_cst will be about twice as fast as locking because locking tends to require, at minimum two atomic operations in its implementation (one to lock, one to unlock). In many cases, it takes even more than that. However, once you start leveraging the memory orders, it can be much faster because you are willing to accept less synchronization.
Also, you'll often see "locking algorithms are always as fast as lock free algorithms." This is somewhat true. The basic idea is that if the fastest algorithm happens to be lock free, then the fastest algorithm without the lock-free guarentee is ALSO the same algorithm! However, if the fastest algortihm requires locks, then those demanding lockfree guarantees have to go find a slower algorithm.
In general, you will see lockfree algorithms in a few low level algorithms, where the performance of leveraging specialized opcodes helps. In almost all other code, locking is more than satisfactory performance, and much easier to read.
Question: Can concurrent a program using locks be as fast as
concurrent lock-free program?
It can be faster: lock free algorithm must keep the global state in a consistent state at all time, and do calculations without knowing if they will be productive as the state might have changed when the calculation is done, making it irrelevant, with lost CPU cycles.
The lock free strategy makes the serialization happen at the end of the process, when the calculation is done. In a pathological case many threads can do an effort and only one effort will be productive, and the others will retry.
Lock free can lead to starvation of some threads, whatever their priority is, and there is no way to avoid that. (Although it's unlikely for a thread to starve retrying for very long unless there is crazy contention.)
On the other hand, "serialized calculation and series of side effect based" (aka lock based) algorithms will not start before they know they will not be prevented by other actors to operate on that specific locked ressource (the guarantee is provided by the use of a mutex). Note that they might be prevented from finishing by the need to access another resource, if multiple locks are taken, leading to possible dead lock when multiple locks are needed in a badly designed program.
Note that this dead lock issue isn't in the scope of lock free code, which can't even act on multiple entities: it usually can't do an atomic commit based on two unrelated objects(1).
So the lack of chance of dead lock for lock free code is sign of weakness of lock free code: not being able to dead lock is a limit of your tool. A system that can only hold of lock at a time also wouldn't be able to dead lock.
The scope of lock free algorithms is minuscule compared to the scope of lock based algorithms. For a lot of problem, lock free doesn't even make sense.
A lock based algorithm is polite, the threads will have to wait in line before doing what they need to do: that is maximally efficient in term of computation steps by each thread. But it's inefficient to have to queue threads in a wait list: they often can't use the end of their time slice, so it can be very inefficient, as someone trying to do serious work while being interrupted by the phone all the time: his concentration is gone and he can't never reach maximum efficiency because his work time to cut into small pieces.
(1) You would have at least need to be able to do a double CAS for that, that is an operation atomic on two arbitrary addresses (not a double word CAS, which is just a CAS on more bits, which can trivially be implemented up to the natural CPU memory access arbitration unit that is the cache line).

What is lock-free multithreaded programming?

I have seen people/articles/SO posts who say they have designed their own "lock-free" container for multithreaded usage. Assuming they haven't used a performance-hitting modulus trick (i.e. each thread can only insert based upon some modulo) how can data structures be multi-threaded but also lock-free???
This question is intended towards C and C++.
The key in lock-free programming is to use hardware-intrinsic atomic operations.
As a matter of fact, even locks themselves must use those atomic operations!
But the difference between locked and lock-free programming is that a lock-free program can never be stalled entirely by any single thread. By contrast, if in a locking program one thread acquires a lock and then gets suspended indefinitely, the entire program is blocked and cannot make progress. By contrast, a lock-free program can make progress even if individual threads are suspended indefinitely.
Here's a simple example: A concurrent counter increment. We present two versions which are both "thread-safe", i.e. which can be called multiple times concurrently. First the locked version:
int counter = 0;
std::mutex counter_mutex;
void increment_with_lock()
{
std::lock_guard<std::mutex> _(counter_mutex);
++counter;
}
Now the lock-free version:
std::atomic<int> counter(0);
void increment_lockfree()
{
++counter;
}
Now imagine hundreds of thread all call the increment_* function concurrently. In the locked version, no thread can make progress until the lock-holding thread unlocks the mutex. By contrast, in the lock-free version, all threads can make progress. If a thread is held up, it just won't do its share of the work, but everyone else gets to get on with their work.
It is worth noting that in general lock-free programming trades throughput and mean latency throughput for predictable latency. That is, a lock-free program will usually get less done than a corresponding locking program if there is not too much contention (since atomic operations are slow and affect a lot of the rest of the system), but it guarantees to never produce unpredictably large latencies.
For locks, the idea is that you acquire a lock and then do your work knowing that nobody else can interfere, then release the lock.
For "lock-free", the idea is that you do your work somewhere else and then attempt to atomically commit this work to "visible state", and retry if you fail.
The problems with "lock-free" are that:
it's hard to design a lock-free algorithm for something that isn't trivial. This is because there's only so many ways to do the "atomically commit" part (often relying on an atomic "compare and swap" that replaces a pointer with a different pointer).
if there's contention, it performs worse than locks because you're repeatedly doing work that gets discarded/retried
it's virtually impossible to design a lock-free algorithm that is both correct and "fair". This means that (under contention) some tasks can be lucky (and repeatedly commit their work and make progress) and some can be very unlucky (and repeatedly fail and retry).
The combination of these things mean that it's only good for relatively simple things under low contention.
Researchers have designed things like lock-free linked lists (and FIFO/FILO queues) and some lock-free trees. I don't think there's anything more complex than those. For how these things work, because it's hard it's complicated. The most sane approach would be to determine what type of data structure you're interested in, then search the web for relevant research into lock-free algorithms for that data structure.
Also note that there is something called "block free", which is like lock-free except that you know you can always commit the work and never need to retry. It's even harder to design a block-free algorithm, but contention doesn't matter so the other 2 problems with lock-free disappear. Note: the "concurrent counter" example in Kerrek SB's answer is not lock free at all, but is actually block free.
The idea of "lock free" is not really not having any lock, the idea is to minimize the number of locks and/or critical sections, by using some techniques that allow us not to use locks for most operations.
It can be achieved using optimistic design or transactional memory, where you do not lock the data for all operations, but only on some certain points (when doing the transaction in transactional memory, or when you need to roll-back in optimistic design).
Other alternatives are based on atomic implementations of some commands, such as CAS (Compare And Swap), that even allows us to solve the consensus problem given an implementation of it. By doing swap on references (and no thread is working on the common data), the CAS mechanism allows us to easily implement a lock-free optimistic design (swapping to the new data if and only if no one have changed it already, and this is done atomically).
However, to implement the underlying mechanism to one of these - some locking will most likely be used, but the amount of time the data will be locked is (supposed) to be kept to minimum, if these techniques are used correctly.
The new C and C++ standards (C11 and C++11) introduced threads, and thread shared atomic data types and operations. An atomic operation gives guarantees for operations that run into a race between two threads. Once a thread returns from such an operation, it can be sure that the operation has gone through in its entirety.
Typical processor support for such atomic operations exists on modern processors for compare and swap (CAS) or atomic increments.
Additionally to being atomic, data type can have the "lock-free" property. This should perhaps have been coined "stateless", since this property implies that an operation on such a type will never leave the object in an intermediate state, even when it is interrupted by an interrupt handler or a read of another thread falls in the middle of an update.
Several atomic types may (or may not) be lock-free, there are macros to test for that property. There is always one type that is guaranteed to be lock free, namely atomic_flag.