lock-free memory reclamation with 64bit pointers

lock-free memory reclamation with 64bit pointers - c++

Herlihy and Shavit's book (The Art of Multiprocessor Programming) solution to memory reclamation uses Java's AtomicStampedReference<T>;.
To write one in C++ for the x86_64 I imagine requires at least a 12 byte swap operation - 8 for a 64bit pointer and 4 for the int.
Is there x86 hardware support for this and if not, any pointers on how to do wait-free memory reclamation without it?

Yes, there is hardware support, though I don't know if it is exposed by C++ libraries. Anyway, if you don't mind doing some low-level unportable assembly language trickery - look up the CMPXCHG16B instruction in Intel manuals.

Windows gives you a bunch of Interlocked functions that are atomic and can probably be used to do what you want. Similar functions exist for other platforms, and I believe Boost has an interlocked library as well.
Your question isn't super clear and I don't have a copy of Herlihy and Shavit laying around. Perhaps if you elaborated or gave psuedo code outlining what you want to do, we can give you a more specific answer.

Ok hopefully, I have the book,
For others that may provides answers, the point is to implement this class :
class AtomicReference<T>{
public:
void set(T *ref, int stamp){ ... }
T *get(int *stamp){ ... }
private:
T *_ref;
int _stamp;
};
in a lock-free way so that :
set() updates the reference and the stamp, atomicly.
get() returns the reference and set *stamp to the stamp corresponding to the reference.
JDonner please, correct me if I am wrong.
Now my answer : I don't think you can do it without a lock somewhere (a lock can be while(test_and_set() != ..)). Therefore there is no lockfree algorithm for this. This would mean that it is possible to build an N-bythe register a lock-free way for any N.
If you look at the book pragma 9.8.1, The AtomicMarkableReference wich is the same with a single bit insteam of an integer stamp. The author suggest to "steal" a bit from a pointer to extract the mark and the pointer from a single word (alsmost quoted) This obviously mean that they want to use a single atomic register to do it.
However, there may be a way to bluid a wait-free memory reclamation without it. I don't know.

Yes, x64 supports this; you need to use CMPXCHG16B.

You can save a bit on memory by relying on the fact that the pointer will use less than 64 bits. First, define a compare&set function (this ASM works in GCC & ICC):
inline bool CAS_ (volatile uint64_t* mem, uint64_t old_val, uint64_t new_val)
{
unsigned long old_high = old_val >> 32, old_low = old_val;
unsigned long new_high = new_val >> 32, new_low = new_val;
char res = 0;
asm volatile("lock; cmpxchg8b (%6);"
"setz %7; "
: "=a" (old_low), // 0
"=d" (old_high) // 1
: "0" (old_low), // 2
"1" (old_high), // 3
"b" (new_low), // 4
"c" (new_high), // 5
"r" (mem), // 6
"m" (res) // 7
: "cc", "memory");
return res;
}
You'll then need to build a tagged-pointer type. I'm assuming a 40-bit pointer with a cacheline-width of 128-bytes (like Nehalem). Aligning to the cache-line will give enormous speed improvements by reducing false-sharing, contention, etc.; this has the obvious trade-off of using a lot more memory, in some situations.
template <typename pointer_type, typename tag_type, int PtrAlign=7, int PtrWidth=40>
struct tagged_pointer
{
static const unsigned int PtrMask = (1 << (PtrWidth - PtrAlign)) - 1;
static const unsigned int TagMask = ~ PtrMask;
typedef unsigned long raw_value_type;
raw_value_type raw_m_;
tagged_pointer () : raw_m_(0) {}
tagged_pointer (pointer_type ptr) { this->pack(ptr, 0); }
tagged_pointer (pointer_type ptr, tag_type tag) { this->pack(ptr, tag); }
void pack (pointer_type ptr, tag_type tag)
{
this->raw_m_ = 0;
this->raw_m_ |= ((ptr >> PtrAlign) & PtrMask);
this->raw_m_ |= ((tag << (PtrWidth - PtrAlign)) & TagMask);
}
pointer_type ptr () const
{
raw_value_type p = (this->raw_m_ & PtrMask) << PtrAlign;
return *reinterpret_cast<pointer_type*>(&p);
}
tag_type tag () const
{
raw_value_type t = (this->raw_m_ & TagMask) >> (PtrWidth - PtrAlign_;
return *reinterpret_cast<tag_type*>(&t);
}
};
I haven't had a chance to debug this code, so you'll need to do that, but this is the general idea.

Note, on x86_64 architecture and gcc you can enable 128 bit CAS. It can be enabled with -mcx16 gcc option.
int main()
{
__int128_t x = 0;
__sync_bool_compare_and_swap(&x,0,10);
return 0;
}
Compile with:
gcc -mcx16 file.c

The cmpxchg16b operation provides the expected operation but beware that some older x86-64 processors don't have this instruction.
You then just need to build an entity with the counter and and the pointer and the asm-inline code. I've written a blog post on the subject here:Implementing Generic Double Word Compare And Swap
Nevertheless, you don't need this operation if you just want to prevent early-free and ABA issues. The hazard pointer is more simpler and doesn't require specific asm code (as long as you use C++11 atomic values.) I've got a repo on bitbucket with experimental implementations of various lock-free algorithms: Lock Free Experiment (beware all these implementations are toys for experimentation, not reliable and tested code for production.)

Related

Advantages of boolean values to bit-fields

The code base I work in is quite old. While we compile nearly everything with c++11. Much of the code was written in c many years ago. When developing new classes in old areas I always find myself in a situation where I have to choose between matching old methodologies, or going with a more modern approach.
In most cases, I prefer sticking with more modern techniques when possible. However, one common old practice I often see, which I have a hard time arguing the use of, is bitfields. We pass a lot of messages, here, many times, they are full of single bit values. Take the example below:
class NewStructure
{
public:
const bool getValue1() const
{
return value1;
}
void setValue1(const bool input)
{
value1 = input;
}
private:
bool value1;
bool value2;
bool value3;
bool value4;
bool value5;
bool value6;
bool value7;
bool value8;
};
struct OldStructure
{
const bool getValue1() const
{
return value1;
}
void setValue1(const bool input)
{
value1 = input;
}
unsigned char value1 : 1;
unsigned char value2 : 1;
unsigned char value3 : 1;
unsigned char value4 : 1;
unsigned char value5 : 1;
unsigned char value6 : 1;
unsigned char value7 : 1;
unsigned char value8 : 1;
};
In this case the sizes are 8 bytes for the New Structure and 1 for the old.
I added a "getter" and "setter" to illustrate the point that from a user perspective, they can be identical. I realize that perhaps you could make the case of readability for the next developer, but other than that, is there a reason to avoid bit fields? I know that packed fields take a performance hit, but because these are all characters, padding rules are still in place.

There are several things to consider when using bitfields. Those are (order of importance would depend on situation)
Performance
Bitfields operation incur performance penalty when set or read (as compared to direct types). A simple example of codegen shows the extra instructions emitted: https://gcc.godbolt.org/z/DpcErN However, bitfields provide for more compact data, which becomes more cache-friendly, and that could completely outweigh any drawbacks of additional operations. The only way to understand the real performance impact is by benchmarking actual application in the real use case.
ABI Interoperability
Endiannes of bit fields is implementation defined, so layout of the same struct produced by two compiler can differ.
Usability
There is no reference binding to a bitfield, nor you can take it's address. This could affect code and make it less clear.

For you, as the programmer, there's not much difference. But the machine code to access a whole byte is much simpler/shorter than to access an individual bit, so using bitfields bulks up the generated code.
In a pseudo-assembly-language, your setter might turn into something like:
ldb input1,b ; get the new value into accumulator b
movb b,value1 ; put it into the variable
rts ; return from subroutine
But it's not so easy for bitfields:
ldb input1,b ; get the new value into accumulator b
movb bitfields,a ; get current bitfield values into accumulator a
cmpb b,#0 ; See what to do.
brz clearvalue1: ; If it's zero, go to clearing the bit
orb #$80,a ; set the bit representing value1.
bra resume: ; skip the clearing code.
clearvalue1:
andb #$7f,a ; clear the bit representing value1
resume:
movb a,bitfields ; put the value back
rts ; return
And it has to do that for each of your 8 members' setters, and something similar for getters. It adds up. Additionally, even today's dumbest compilers would probably inline the full-byte setter code, rather than actually make a subroutine call. For the bitfield setter, it may depend whether you're compiling optimizing for speed vs. space.
And you've only asked about booleans. If they were integer bitfields, then the compiler has to deal with loading, masking out prior values, shifting the value to align into its field, masking out unused bits, and/or the value into place, then writing it back to memory.
So why would you use one vs. the other?
Bitfields are slower, but pack data more efficiently.
Non-bitfields are faster, and require less machine code to access.
As the developer, it's your judgement call. If you will be keeping many instances of Structure in memory at once, the memory savings may be worth it. If you aren't going to have many instances of that structure in memory at once, the compiled code bloat offsets memory savings and you're sacrificing speed.

template<typename enum_type,size_t n_bits>
class bit_flags{
std::bitset<n_bits> bits;
auto operator[](enum_type bit){return bits[bit];};
auto& set(enum_type bit)){return set(bit);};
auto& reset(enum_type bit)){return set(bit);};
//go on with flip et al...
static_assert(std::is_enum<enum_type>{});
};
enum class v_flags{v1,v2,/*...*/vN};
bit_flags<v_flags,v_flags::vN+1> my_flags;
my_flags.set(v_flags::v1);
my_flags.[v_flags::v2]=true;
std::bitset is as efficient as bool bit fields. You can wrap it in a class to force using every bit by names defined in an enum. Now you have a small but scalable utility to use for multiple defferent sets of bool flags. C++17 makes it even more convenient:
template<auto last_flag, typename enum_type=decltype(last_flag)>
class bit_flags{
std::bitset<last_flag+1> bits;
//...
};
bit_flags<v_flags::vN+1> my_flags;

Fast and Lock Free Single Writer, Multiple Reader

I've got a single writer which has to increment a variable at a fairly high frequence and also one or more readers who access this variable on a lower frequency.
The write is triggered by an external interrupt.
Since i need to write with high speed i don't want to use mutexes or other expensive locking mechanisms.
The approach i came up with was copying the value after writing to it. The reader now can compare the original with the copy. If they are equal, the variable's content is valid.
Here my implementation in C++
template<typename T>
class SafeValue
{
private:
volatile T _value;
volatile T _valueCheck;
public:
void setValue(T newValue)
{
_value = newValue;
_valueCheck = _value;
}
T getValue()
{
volatile T value;
volatile T valueCheck;
do
{
valueCheck = _valueCheck;
value = _value;
} while(value != valueCheck);
return value;
}
}
The idea behind this is to detect data races while reading and retry if they happen. However, i don't know if this will always work. I haven't found anything about this aproach online, therefore my question:
Is there any problem with my aproach when used with a single writer and multiple readers?
I already know that high writing frequencys may cause starvation of the reader. Are there more bad effects i have to be cautious of? Could it even be that this isn't threadsafe at all?
Edit 1:
My target system is a ARM Cortex-A15.
T should be able to become at least any primitive integral type.
Edit 2:
std::atomic is too slow on reader and writer site. I benchmarked it on my system. Writes are roughly 30 times slower, reads roughly 50 times compared to unprotected, primitive operations.

Is this single variable just an integer, pointer, or plain old value type, you can probably just use std::atomic.

You should try using std::atomic first, but make sure that your compiler knows and understands your target architecture. Since you are targeting Cortex-A15 (ARMv7-A cpu), make sure to use -march=armv7-a or even -mcpu=cortex-a15.
The first shall generate ldrexd instruction which should be atomic according to ARM docs:
Single-copy atomicity
In ARMv7, the single-copy atomic processor accesses are:
all byte accesses
all halfword accesses to halfword-aligned locations
all word accesses to word-aligned locations
memory accesses caused by LDREXD and STREXD instructions to doubleword-aligned locations.
The latter shall generate ldrd instruction which should be atomic on targets supporting Large Physical Address Extension:
In an implementation that includes the Large Physical Address Extension, LDRD and STRD accesses to 64-bit aligned locations are 64-bit single-copy atomic as seen by translation table walks and accesses to translation tables.
--- Note ---
The Large Physical Address Extension adds this requirement to avoid the need to complex measures to avoid atomicity issues when changing translation table entries, without creating a requirement that all locations in the memory system are 64-bit single-copy atomic.
You can also check how Linux kernel implements those:
#ifdef CONFIG_ARM_LPAE
static inline long long atomic64_read(const atomic64_t *v)
{
long long result;
__asm__ __volatile__("# atomic64_read\n"
" ldrd %0, %H0, [%1]"
: "=&r" (result)
: "r" (&v->counter), "Qo" (v->counter)
);
return result;
}
#else
static inline long long atomic64_read(const atomic64_t *v)
{
long long result;
__asm__ __volatile__("# atomic64_read\n"
" ldrexd %0, %H0, [%1]"
: "=&r" (result)
: "r" (&v->counter), "Qo" (v->counter)
);
return result;
}
#endif

There's no way anyone can know. You would have to see if either your compiler documents any multi-threaded semantics that would guarantee that this will work or look at the generated assembler code and convince yourself that it will work. Be warned that in the latter case, it is always possible that a later version of the compiler, or different optimizations options or a newer CPU, might break the code.
I'd suggest testing std::atomic with the appropriate memory_order. If for some reason that's too slow, use inline assembly.

Another option is to have a buffer of non-atomic values the publisher produces and an atomic pointer to the latest.
#include <atomic>
#include <utility>
template<class T>
class PublisherValue {
static auto constexpr N = 32;
T values_[N];
std::atomic<T*> current_{values_};
public:
PublisherValue() = default;
PublisherValue(PublisherValue const&) = delete;
PublisherValue& operator=(PublisherValue const&) = delete;
// Single writer thread only.
template<class U>
void store(U&& value) {
T* p = current_.load(std::memory_order_relaxed);
if(++p == values_ + N)
p = values_;
*p = std::forward<U>(value);
current_.store(p, std::memory_order_release); // (1)
}
// Multiple readers. Make a copy to avoid referring the value for too long.
T load() const {
return *current_.load(std::memory_order_consume); // Sync with (1).
}
};
This is wait-free, but there is a small chance that a reader might be de-scheduled while copying the value and hence read the oldest value while it has been partially overwritten. Making N bigger reduces this risk.

WinAPI _Interlocked* intrinsic functions for char, short

I need to use _Interlocked*** function on char or short, but it takes long pointer as input. It seems that there is function _InterlockedExchange8, I don't see any documentation on that. Looks like this is undocumented feature. Also compiler wasn't able to find _InterlockedAdd8 function.
I would appreciate any information on that functions, recommendations to use/not to use and other solutions as well.
update 1
I'll try to simplify the question.
How can I make this work?
struct X
{
char data;
};
X atomic_exchange(X another)
{
return _InterlockedExchange( ??? );
}
I see two possible solutions
Use _InterlockedExchange8
Cast another to long, do exchange and cast result back to X
First one is obviously bad solution.
Second one looks better, but how to implement it?
update 2
What do you think about something like this?
template <typename T, typename U>
class padded_variable
{
public:
padded_variable(T v): var(v) {}
padded_variable(U v): var(*static_cast<T*>(static_cast<void*>(&v))) {}
U& cast()
{
return *static_cast<U*>(static_cast<void*>(&var));
}
T& get()
{
return var;
}
private:
T var;
char padding[sizeof(U) - sizeof(T)];
};
struct X
{
char data;
};
template <typename T, int S = sizeof(T)> class var;
template <typename T> class var<T, 1>
{
public:
var(): data(T()) {}
T atomic_exchange(T another)
{
padded_variable<T, long> xch(another);
padded_variable<T, long> res(_InterlockedExchange(&data.cast(), xch.cast()));
return res.get();
}
private:
padded_variable<T, long> data;
};
Thanks.

It's pretty easy to make 8-bit and 16-bit interlocked functions but the reason they're not included in WinAPI is due to IA64 portability. If you want to support Win64 the assembler cannot be inline as MSVC no longer supports it. As external function units, using MASM64, they will not be as fast as inline code or intrinsics so you are wiser to investigate promoting algorithms to use 32-bit and 64-bit atomic operations instead.
Example interlocked API implementation: intrin.asm

Why do you want to use smaller data types? So you can fit a bunch of them in a small memory space? That's just going to lead to false sharing and cache line contention.
Whether you use locking or lockless algorithms, it's ideal to have your data in blocks of at least 128 bytes (or whatever the cache line size is on your CPU) that are only used by a single thread at a time.

Well, you have to make do with the functions available. _InterlockedIncrement and `_InterlockedCompareExchange are available in 16 and 32-bit variants (the latter in a 64-bit variant as well), and maybe a few other interlocked intrinsics are available in 16-bit versions as well, but InterlockedAdd doesn't seem to be, and there seem to be no byte-sized Interlocked intrinsics/functions at all.
So... You need to take a step back and figure out how to solve your problem without an IntrinsicAdd8.
Why are you working with individual bytes in any case? Stick to int-sized objects unless you have a really good reason to use something smaller.

Creating a new answer because your edit changed things a bit:
Use _InterlockedExchange8
Cast another to long, do exchange and cast result back to X
The first simply won't work. Even if the function existed, it would allow you to atomically update a byte at a time. Which means that the object as a whole would be updated in a series of steps which wouldn't be atomic.
The second doesn't work either, unless X is a long-sized POD type. (and unless it is aligned on a sizeof(long) boundary, and unless it is of the same size as a long)
In order to solve this problem you need to narrow down what types X might be. First, of course, is it guaranteed to be a POD type? If not, you have an entirely different problem, as you can't safely treat non-POD types as raw memory bytes.
Second, what sizes may X have? The Interlocked functions can handle 16, 32 and, depending on circumstances, maybe 64 or even 128 bit widths.
Does that cover all the cases you can encounter?
If not, you may have to abandon these atomic operations, and settle for plain old locks. Lock a Mutex to ensure that only one thread touches these objects at a time.

G++ SSE memory alignment on the stack

I am attempting to re-write a raytracer using Streaming SIMD Extensions. My original raytracer used inline assembly and movups instructions to load data into the xmm registers. I have read that compiler intrinsics are not significantly slower than inline assembly (I suspect I may even gain speed by avoiding unaligned memory accesses), and much more portable, so I am attempting to migrate my SSE code to use the intrinsics in xmmintrin.h. The primary class affected is vector, which looks something like this:
#include "xmmintrin.h"
union vector {
__m128 simd;
float raw[4];
//some constructors
//a bunch of functions and operators
} __attribute__ ((aligned (16)));
I have read previously that the g++ compiler will automatically allocate structs along memory boundaries equal to that of the size of the largest member variable, but this does not seem to be occurring, and the aligned attribute isn't helping. My research indicates that this is likely because I am allocating a whole bunch of function-local vectors on the stack, and that alignment on the stack is not guaranteed in x86. Is there any way to force this alignment? I should mention that this is running under native x86 Linux on a 32-bit machine, not Cygwin. I intend to implement multithreading in this application further down the line, so declaring the offending vector instances to be static isn't an option. I'm willing to increase the size of my vector data structure, if needed.

The simplest way is std::aligned_storage, which takes alignment as a second parameter.
If you don't have it yet, you might want to check Boost's version.
Then you can build your union:
union vector {
__m128 simd;
std::aligned_storage<16, 16> alignment_only;
}
Finally, if it does not work, you can always create your own little class:
template <typename Type, intptr_t Align> // Align must be a power of 2
class RawStorage
{
public:
Type* operator->() {
return reinterpret_cast<Type const*>(aligned());
}
Type const* operator->() const {
return reinterpret_cast<Type const*>(aligned());
}
Type& operator*() { return *(operator->()); }
Type const& operator*() const { return *(operator->()); }
private:
unsigned char* aligned() {
if (data & ~(Align-1) == data) { return data; }
return (data + Align) & ~(Align-1);
}
unsigned char data[sizeof(Type) + Align - 1];
};
It will allocate a bit more storage than necessary, but this way alignment is guaranteed.
int main(int argc, char* argv[])
{
RawStorage<__m128, 16> simd;
*simd = /* ... */;
return 0;
}
With luck, the compiler might be able to optimize away the pointer alignment stuff if it detects the alignment is necessary right.

A few weeks ago, I had re-written an old ray tracing assignment from my university days, updating it to run it on 64-bit linux and to make use of the SIMD instructions. (The old version incidentally ran under DOS on a 486, to give you an idea of when I last did anything with it).
There very well may be better ways of doing it, but here is what I did ...
typedef float v4f_t __attribute__((vector_size (16)));
class Vector {
...
union {
v4f_t simd;
float f[4];
} __attribute__ ((aligned (16)));
...
};
Disassembling my compiled binary showed that it was indeed making use of the movaps instruction.
Hope this helps.

Normally all you should need is:
union vector {
__m128 simd;
float raw[4];
};
i.e. no additional __attribute__ ((aligned (16))) required for the union itself.
This works as expected on pretty much every compiler I've ever used, with the notable exception of gcc 2.95.2 back in the day, which used to screw up stack alignment in some cases.

I use this union trick all the time with __m128 and it works with GCC on Mac and Visual C++ on Windows, so this must be a bug in the compiler that you use.
The other answers contain good workarounds though.

If you need an array of N of these objects, allocate vector raw[N+1], and use vector* const array = reinterpret_cast<vector*>(reinterpret_cast<intptr_t>(raw+1) & ~15) as the base address of your array. This will always be aligned.

When is it worthwhile to use bit fields?

Is it worthwhile using C's bit-field implementation? If so, when is it ever used?
I was looking through some emulator code and it looks like the registers for the chips are not being implemented using bit fields.
Is this something that is avoided for performance reasons (or some other reason)?
Are there still times when bit-fields are used? (ie firmware to put on actual chips, etc)

Bit-fields are typically only used when there's a need to map structure fields to specific bit slices, where some hardware will be interpreting the raw bits. An example might be assembling an IP packet header. I can't see a compelling reason for an emulator to model a register using bit-fields, as it's never going to touch real hardware!
Whilst bit-fields can lead to neat syntax, they're pretty platform-dependent, and therefore non-portable. A more portable, but yet more verbose, approach is to use direct bitwise manipulation, using shifts and bit-masks.
If you use bit-fields for anything other than assembling (or disassembling) structures at some physical interface, performance may suffer. This is because every time you read or write from a bit-field, the compiler will have to generate code to do the masking and shifting, which will burn cycles.

One use for bitfields which hasn't yet been mentioned is that unsigned bitfields provide arithmetic modulo a power-of-two "for free". For example, given:
struct { unsigned x:10; } foo;
arithmetic on foo.x will be performed modulo 210 = 1024.
(The same can be achieved directly by using bitwise & operations, of course - but sometimes it might lead to clearer code to have the compiler do it for you).

FWIW, and looking only at the relative performance question - a bodgy benchmark:
#include <time.h>
#include <iostream>
struct A
{
void a(unsigned n) { a_ = n; }
void b(unsigned n) { b_ = n; }
void c(unsigned n) { c_ = n; }
void d(unsigned n) { d_ = n; }
unsigned a() { return a_; }
unsigned b() { return b_; }
unsigned c() { return c_; }
unsigned d() { return d_; }
volatile unsigned a_:1,
b_:5,
c_:2,
d_:8;
};
struct B
{
void a(unsigned n) { a_ = n; }
void b(unsigned n) { b_ = n; }
void c(unsigned n) { c_ = n; }
void d(unsigned n) { d_ = n; }
unsigned a() { return a_; }
unsigned b() { return b_; }
unsigned c() { return c_; }
unsigned d() { return d_; }
volatile unsigned a_, b_, c_, d_;
};
struct C
{
void a(unsigned n) { x_ &= ~0x01; x_ |= n; }
void b(unsigned n) { x_ &= ~0x3E; x_ |= n << 1; }
void c(unsigned n) { x_ &= ~0xC0; x_ |= n << 6; }
void d(unsigned n) { x_ &= ~0xFF00; x_ |= n << 8; }
unsigned a() const { return x_ & 0x01; }
unsigned b() const { return (x_ & 0x3E) >> 1; }
unsigned c() const { return (x_ & 0xC0) >> 6; }
unsigned d() const { return (x_ & 0xFF00) >> 8; }
volatile unsigned x_;
};
struct Timer
{
Timer() { get(&start_tp); }
double elapsed() const {
struct timespec end_tp;
get(&end_tp);
return (end_tp.tv_sec - start_tp.tv_sec) +
(1E-9 * end_tp.tv_nsec - 1E-9 * start_tp.tv_nsec);
}
private:
static void get(struct timespec* p_tp) {
if (clock_gettime(CLOCK_REALTIME, p_tp) != 0)
{
std::cerr << "clock_gettime() error\n";
exit(EXIT_FAILURE);
}
}
struct timespec start_tp;
};
template <typename T>
unsigned f()
{
int n = 0;
Timer timer;
T t;
for (int i = 0; i < 10000000; ++i)
{
t.a(i & 0x01);
t.b(i & 0x1F);
t.c(i & 0x03);
t.d(i & 0xFF);
n += t.a() + t.b() + t.c() + t.d();
}
std::cout << timer.elapsed() << '\n';
return n;
}
int main()
{
std::cout << "bitfields: " << f<A>() << '\n';
std::cout << "separate ints: " << f<B>() << '\n';
std::cout << "explicit and/or/shift: " << f<C>() << '\n';
}
Output on my test machine (numbers vary by ~20% run to run):
bitfields: 0.140586
1449991808
separate ints: 0.039374
1449991808
explicit and/or/shift: 0.252723
1449991808
Suggests that with g++ -O3 on a pretty recent Athlon, bitfields are worse than a few times slower than separate ints, and this particular and/or/bitshift implementation's at least twice as bad again ("worse" as other operations like memory read/writes are emphasised by the volatility above, and there's loop overhead etc, so the differences are understated in the results).
If you're dealing in hundreds of megabytes of structs that can be mainly bitfields or mainly distinct ints, the caching issues may become dominant - so benchmark in your system.
update from 2021 with an AMD Ryzen 9 3900X and -O2 -march=native:
bitfields: 0.0224893
1449991808
separate ints: 0.0288447
1449991808
explicit and/or/shift: 0.0190325
1449991808
Here we see everything has changed massively, the main implication being - benchmark with the systems you care about.
UPDATE: user2188211 attempted an edit which was rejected but usefully illustrated how bitfields become faster as the amount of data increases: "when iterating over a vector of a few million elements in [a modified version of] the above code, such that the variables do not reside in cache or registers, the bitfield code may be the fastest."
template <typename T>
unsigned f()
{
int n = 0;
Timer timer;
std::vector<T> ts(1024 * 1024 * 16);
for (size_t i = 0, idx = 0; i < 10000000; ++i)
{
T& t = ts[idx];
t.a(i & 0x01);
t.b(i & 0x1F);
t.c(i & 0x03);
t.d(i & 0xFF);
n += t.a() + t.b() + t.c() + t.d();
idx++;
if (idx >= ts.size()) {
idx = 0;
}
}
std::cout << timer.elapsed() << '\n';
return n;
}
Results on from an example run (g++ -03, Core2Duo):
0.19016
bitfields: 1449991808
0.342756
separate ints: 1449991808
0.215243
explicit and/or/shift: 1449991808
Of course, timing's all relative and which way you implement these fields may not matter at all in the context of your system.

I've seen/used bit fields in two situations: Computer Games and Hardware Interfaces. The hardware use is pretty straight forward: the hardware expects data in a certain bit format you can either define manually or through pre-defined library structures. It depends on the specific library whether they use bit fields or just bit manipulation.
In the "old days" computers games used bit fields frequently to make the most use of computer/disk memory as possible. For example, for a NPC definition in a RPG you might find (made up example):
struct charinfo_t
{
unsigned int Strength : 7; // 0-100
unsigned int Agility : 7;
unsigned int Endurance: 7;
unsigned int Speed : 7;
unsigned int Charisma : 7;
unsigned int HitPoints : 10; //0-1000
unsigned int MaxHitPoints : 10;
//etc...
};
You don't see it so much in more modern games/software as the space savings has gotten proportionally worse as computers get more memory. Saving a 1MB of memory when your computer only has 16MB is a big deal but not so much when you have 4GB.

The primary purpose of bit-fields is to provide a way to save memory in massively instantiated aggregate data structures by achieving tighter packing of data.
The whole idea is to take advantage of situations where you have several fields in some struct type, which don't need the entire width (and range) of some standard data type. This provides you with the opportunity to pack several of such fields in one allocation unit, thus reducing the overall size of the struct type. And extreme example would be boolean fields, which can be represented by individual bits (with, say, 32 of them being packable into a single unsigned int allocation unit).
Obviously, this only makes sense in situation where the pros of the reduced memory consumption outweigh the cons of slower access to values stored in bit-fields. However, such situations arise quite often, which makes bit-fields an absolutely indispensable language feature. This should answer your question about the modern use of bit-fields: not only they are used, they are essentially mandatory in any practically meaningful code oriented on processing large amounts of homogeneous data (like large graphs, for one example), because their memory-saving benefits greatly outweigh any individual-access performance penalties.
In a way, bit-fields in their purpose are very similar to such things as "small" arithmetic types: signed/unsigned char, short, float. In the actual data-processing code one would not normally use any types smaller than int or double (with few exceptions). Arithmetic types like signed/unsigned char, short, float exist just to serve as "storage" types: as memory-saving compact members of struct types in situations where their range (or precision) is known to be sufficient. Bit-fields is just another step in the same direction, that trades a bit more performance for much greater memory-saving benefits.
So, that gives us a rather clear set of conditions under which it is worthwhile to employ bit-fields:
Struct type contains multiple fields that can be packed into a smaller number of bits.
The program instantiates a large number of objects of that struct type.
If the conditions are met, you declare all bit-packable fields contiguously (typically at the end of the struct type), assign them their appropriate bit-widths (and, usually, take some steps to ensure that the bit-widths are appropriate). In most cases it makes sense to play around with ordering of these fields to achieve the best packing and/or performance.
There's also a weird secondary use of bit-fields: using them for mapping bit groups in various externally-specified representations, like hardware registers, floating-point formats, file formats etc. This has never been intended as a proper use of bit-fields, even though for some unexplained reason this kind of bit-field abuse continues to pop-up in real-life code. Just don't do this.

One use for bit fields used to be to mirror hardware registers when writing embedded code. However, since the bit order is platform-dependent, they don't work if the hardware orders its bits different from the processor. That said, I can't think of a use for bit fields any more. You're better off implementing a bit manipulation library that can be ported across platforms.

Bit fields were used in the olden days to save program memory.
They degrade performance because registers can not work with them so they have to be converted to integers to do anything with them. They tend to lead to more complex code that is unportable and harder to understand (since you have to mask and unmask things all the time to actually use the values.)
Check out the source for http://www.nethack.org/ to see pre ansi c in all its bitfield glory!

In the 70s I used bit fields to control hardware on a trs80. The display/keyboard/cassette/disks were all memory mapped devices. Individual bits controlled various things.
A bit controlled 32 column vs 64 column display.
Bit 0 in that same memory cell was the cassette serial data in/out.
As I recall, the disk drive control had a number of them. There were 4 bytes in total. I think there was a 2 bit drive select. But it was a long time ago. It was kind of impressive back then in that there were at least two different c compilers for the platform.
The other observation is that bit fields really are platform specific. There is no expectation that a program with bit fields should port to another platform.

In modern code, there's really only one reason to use bitfields: to control the space requirements of a bool or an enum type, within a struct/class. For instance (C++):
enum token_code { TK_a, TK_b, TK_c, ... /* less than 255 codes */ };
struct token {
token_code code : 8;
bool number_unsigned : 1;
bool is_keyword : 1;
/* etc */
};
IMO there's basically no reason not to use :1 bitfields for bool, as modern compilers will generate very efficient code for it. In C, though, make sure your bool typedef is either the C99 _Bool or failing that an unsigned int, because a signed 1-bit field can hold only the values 0 and -1 (unless you somehow have a non-twos-complement machine).
With enumeration types, always use a size that corresponds to the size of one of the primitive integer types (8/16/32/64 bits, on normal CPUs) to avoid inefficient code generation (repeated read-modify-write cycles, usually).
Using bitfields to line up a structure with some externally-defined data format (packet headers, memory-mapped I/O registers) is commonly suggested, but I actually consider it a bad practice, because C doesn't give you enough control over endianness, padding, and (for I/O regs) exactly what assembly sequences get emitted. Have a look at Ada's representation clauses sometime if you want to see how much C is missing in this area.

Boost.Thread uses bitfields in its shared_mutex, on Windows at least:
struct state_data
{
unsigned shared_count:11,
shared_waiting:11,
exclusive:1,
upgrade:1,
exclusive_waiting:7,
exclusive_waiting_blocked:1;
};

An alternative to consider is to specify bit field structures with a dummy structure (never instantiated) where each byte represents a bit:
struct Bf_format
{
char field1[5];
char field2[9];
char field3[18];
};
With this approach sizeof gives the width of the bit field, and offsetof give the offset of the bit field. At least in the case of GNU gcc, compiler optimization of bit-wise operations (with constant shifts and masks) seems to have gotten to rough parity with (base language) bit fields.
I have written a C++ header file (using this approach) which allows structures of bit fields to be defined and used in a performant, much more portable, much more flexible way: https://github.com/wkaras/C-plus-plus-library-bit-fields . So, unless you are stuck using C, I think there would rarely be a good reason to use the base language facility for bit fields.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js