WinAPI _Interlocked* intrinsic functions for char, short - c++

I need to use _Interlocked*** function on char or short, but it takes long pointer as input. It seems that there is function _InterlockedExchange8, I don't see any documentation on that. Looks like this is undocumented feature. Also compiler wasn't able to find _InterlockedAdd8 function.
I would appreciate any information on that functions, recommendations to use/not to use and other solutions as well.
update 1
I'll try to simplify the question.
How can I make this work?
struct X
{
char data;
};
X atomic_exchange(X another)
{
return _InterlockedExchange( ??? );
}
I see two possible solutions
Use _InterlockedExchange8
Cast another to long, do exchange and cast result back to X
First one is obviously bad solution.
Second one looks better, but how to implement it?
update 2
What do you think about something like this?
template <typename T, typename U>
class padded_variable
{
public:
padded_variable(T v): var(v) {}
padded_variable(U v): var(*static_cast<T*>(static_cast<void*>(&v))) {}
U& cast()
{
return *static_cast<U*>(static_cast<void*>(&var));
}
T& get()
{
return var;
}
private:
T var;
char padding[sizeof(U) - sizeof(T)];
};
struct X
{
char data;
};
template <typename T, int S = sizeof(T)> class var;
template <typename T> class var<T, 1>
{
public:
var(): data(T()) {}
T atomic_exchange(T another)
{
padded_variable<T, long> xch(another);
padded_variable<T, long> res(_InterlockedExchange(&data.cast(), xch.cast()));
return res.get();
}
private:
padded_variable<T, long> data;
};
Thanks.

It's pretty easy to make 8-bit and 16-bit interlocked functions but the reason they're not included in WinAPI is due to IA64 portability. If you want to support Win64 the assembler cannot be inline as MSVC no longer supports it. As external function units, using MASM64, they will not be as fast as inline code or intrinsics so you are wiser to investigate promoting algorithms to use 32-bit and 64-bit atomic operations instead.
Example interlocked API implementation: intrin.asm

Why do you want to use smaller data types? So you can fit a bunch of them in a small memory space? That's just going to lead to false sharing and cache line contention.
Whether you use locking or lockless algorithms, it's ideal to have your data in blocks of at least 128 bytes (or whatever the cache line size is on your CPU) that are only used by a single thread at a time.

Well, you have to make do with the functions available. _InterlockedIncrement and `_InterlockedCompareExchange are available in 16 and 32-bit variants (the latter in a 64-bit variant as well), and maybe a few other interlocked intrinsics are available in 16-bit versions as well, but InterlockedAdd doesn't seem to be, and there seem to be no byte-sized Interlocked intrinsics/functions at all.
So... You need to take a step back and figure out how to solve your problem without an IntrinsicAdd8.
Why are you working with individual bytes in any case? Stick to int-sized objects unless you have a really good reason to use something smaller.

Creating a new answer because your edit changed things a bit:
Use _InterlockedExchange8
Cast another to long, do exchange and cast result back to X
The first simply won't work. Even if the function existed, it would allow you to atomically update a byte at a time. Which means that the object as a whole would be updated in a series of steps which wouldn't be atomic.
The second doesn't work either, unless X is a long-sized POD type. (and unless it is aligned on a sizeof(long) boundary, and unless it is of the same size as a long)
In order to solve this problem you need to narrow down what types X might be. First, of course, is it guaranteed to be a POD type? If not, you have an entirely different problem, as you can't safely treat non-POD types as raw memory bytes.
Second, what sizes may X have? The Interlocked functions can handle 16, 32 and, depending on circumstances, maybe 64 or even 128 bit widths.
Does that cover all the cases you can encounter?
If not, you may have to abandon these atomic operations, and settle for plain old locks. Lock a Mutex to ensure that only one thread touches these objects at a time.

Related

Can I check a small array of bools in one go?

There was a similar question here, but the user in that question seemed to have a much larger array, or vector. If I have:
bool boolArray[4];
And I want to check if all elements are false, I can check [ 0 ], [ 1 ] , [ 2 ] and [ 3 ] either separately, or I can loop through it. Since (as far as I know) false should have value 0 and anything other than 0 is true, I thought about simply doing:
if ( *(int*) boolArray) { }
This works, but I realize that it relies on bool being one byte and int being four bytes. If I cast to (std::uint32_t) would it be OK, or is it still a bad idea? I just happen to have 3 or 4 bools in an array and was wondering if this is safe, and if not if there is a better way to do it.
Also, in the case I end up with more than 4 bools but less than 8 can I do the same thing with a std::uint64_t or unsigned long long or something?
As πάντα ῥεῖ noticed in comments, std::bitset is probably the best way to deal with that in UB-free manner.
std::bitset<4> boolArray {};
if(boolArray.any()) {
//do the thing
}
If you want to stick to arrays, you could use std::any_of, but this requires (possibly peculiar to the readers) usage of functor which just returns its argument:
bool boolArray[4];
if(std::any_of(std::begin(boolArray), std::end(boolArray), [](bool b){return b;}) {
//do the thing
}
Type-punning 4 bools to int might be a bad idea - you cannot be sure of the size of each of the types. It probably will work on most architectures, but std::bitset is guaranteed to work everywhere, under any circumstances.
Several answers have already explained good alternatives, particularly std::bitset and std::any_of(). I am writing separately to point out that, unless you know something we don't, it is not safe to type pun between bool and int in this fashion, for several reasons:
int might not be four bytes, as multiple answers have pointed out.
M.M points out in the comments that bool might not be one byte. I'm not aware of any real-world architectures in which this has ever been the case, but it is nevertheless spec-legal. It (probably) can't be smaller than a byte unless the compiler is doing some very elaborate hide-the-ball chicanery with its memory model, and a multi-byte bool seems rather useless. Note however that a byte need not be 8 bits in the first place.
int can have trap representations. That is, it is legal for certain bit patterns to cause undefined behavior when they are cast to int. This is rare on modern architectures, but might arise on (for example) ia64, or any system with signed zeros.
Regardless of whether you have to worry about any of the above, your code violates the strict aliasing rule, so compilers are free to "optimize" it under the assumption that the bools and the int are entirely separate objects with non-overlapping lifetimes. For example, the compiler might decide that the code which initializes the bool array is a dead store and eliminate it, because the bools "must have" ceased to exist* at some point before you dereferenced the pointer. More complicated situations can also arise relating to register reuse and load/store reordering. All of these infelicities are expressly permitted by the C++ standard, which says the behavior is undefined when you engage in this kind of type punning.
You should use one of the alternative solutions provided by the other answers.
* It is legal (with some qualifications, particularly regarding alignment) to reuse the memory pointed to by boolArray by casting it to int and storing an integer, although if you actually want to do this, you must then pass boolArray through std::launder if you want to read the resulting int later. Regardless, the compiler is entitled to assume that you have done this once it sees the read, even if you don't call launder.
You can use std::bitset<N>::any:
Any returns true if any of the bits are set to true, otherwise false.
#include <iostream>
#include <bitset>
int main ()
{
std::bitset<4> foo;
// modify foo here
if (foo.any())
std::cout << foo << " has " << foo.count() << " bits set.\n";
else
std::cout << foo << " has no bits set.\n";
return 0;
}
Live
If you want to return true if all or none of the bits set to on, you can use std::bitset<N>::all or std::bitset<N>::none respectively.
The standard library has what you need in the form of the std::all_of, std::any_of, std::none_of algorithms.
...And for the obligatory "roll your own" answer, we can provide a simple "or"-like function for any array bool[N], like so:
template<size_t N>
constexpr bool or_all(const bool (&bs)[N]) {
for (bool b : bs) {
if (b) { return b; }
}
return false;
}
Or more concisely,
template<size_t N>
constexpr bool or_all(const bool (&bs)[N]) {
for (bool b : bs) { if (b) { return b; } }
return false;
}
This also has the benefit of both short-circuiting like ||, and being optimised out entirely if calculable at compile time.
Apart from that, if you want to examine the original idea of type-punning bool[N] to some other type to simplify observation, I would very much recommend that you don't do that view it as char[N2] instead, where N2 == (sizeof(bool) * N). This would allow you to provide a simple representation viewer that can automatically scale to the viewed object's actual size, allow iteration over its individual bytes, and allow you to more easily determine whether the representation matches specific values (such as, e.g., zero or non-zero). I'm not entirely sure off the top of my head whether such examination would invoke any UB, but I can say for certain that any such type's construction cannot be a viable constant-expression, due to requiring a reinterpret cast to char* or unsigned char* or similar (either explicitly, or in std::memcpy()), and thus couldn't as easily be optimised out.

Filling an std::vector with raw data

I need to fill a vector with raw data, sometimes 2 bytes, sometimes 8... I ended up with this template function:
template <typename T>
void fillVector(std::vector<uint8_t>& dest, T t)
{
auto ptr = reinterpret_cast<uint8_t*>(&t);
dest.insert(dest.end(),ptr,ptr+sizeof(t));
}
with this I can fill the vector like this:
fillVector<uint32_t>(dst,32bitdata);
fillVector<uint16_t>(dst,16bitdata);
I was wondering if something more similar already exist in the standard library?
No, there's nothing in the standard library to achieve what you are after. So your solution is pretty much what you can currently go with (assuming your goal is to do some form of serialization).
The only point of improvement is that you are assuming uint8_t is a type that may be used to alias an object and inspect its bytes. That need not be the case. The only such types in C++11 are char and unsigned char. While uint8_t usually aliases the later in most modern architectures, that's not a hard requirement, it could alias a platform specific 8 bit unsigned integer type (the merits of that are outside the scope of this question). So to be standard conforming, either guard against it:
static_assert(std::is_same<unsigned char, std::uint8_t>::value, "Oops!");
Or use your own alias for valid "byte" type
namespace myapp { using byte = unsigned char; }
and deal in std::vector<myapp::byte>.

Portable bit fields for Handles

I want to use and store "Handles" to data in an object buffer to reduce allocation overhead. The handle is simply an index into an array with the object. However I need to detect use-after-reallocations, as this could slip in quite easily. The common approach seems to be using bit fields. However this leads to 2 problems:
Bit fields are implementation defined
Bit shifting is not portable across big/little endian machines.
What I need:
Store handle to file (file handler can manage either integer types (byte swapping) or byte arrays)
Store 2 values in the handle with minimum space
What I got:
template<class T_HandleDef, typename T_Storage = uint32_t>
struct Handle
{
typedef T_HandleDef HandleDef;
typedef T_Storage Storage;
Handle(): handle_(0){}
private:
const T_Storage handle_;
};
template<unsigned T_numIndexBits = 16, typename T_Tag = void>
struct HandleDef{
static const unsigned numIndexBits = T_numIndexBits;
};
template<class T_Handle>
struct HandleAccessor{
typedef typename T_Handle::Storage Storage;
typedef typename T_Handle::HandleDef HandleDef;
static const unsigned numIndexBits = HandleDef::numIndexBits;
static const unsigned numMagicBits = sizeof(Storage) * 8 - numIndexBits;
/// "Magic" struct that splits the handle into values
union HandleData{
struct
{
Storage index : numIndexBits;
Storage magic : numMagicBits;
};
T_Handle handle;
};
};
A usage would be for example:
typedef Handle<HandleDef<24> > FooHandle;
FooHandle Create(unsigned idx, unsigned m){
HandleAccessor<FooHandle>::HandleData data;
data.idx = idx;
data.magic = m;
return data.handle;
}
My goal was to keep the handle as opaque as possible, add a bool check but nothing else. Users of the handle should not be able to do anything with it but passing it around.
So problems I run into:
Union is UB -> Replace its T_Handle by Storage and add a ctor to Handle from Storage
How does the compiler layout the bit field? I fill the whole union/type so there should be no padding. So probably the only thing that can be different is which type comes first depending on endianess, correct?
How can I store handle_ to a file and load it from a possible different endianess machine and still have index and magic be correct? I think I can store the containing Storage 'endian-correct' and get correct values, IF both members occupy exactly half the space (2 Shorts in an uint) But I always want more space for the index than for the magic value.
Note: There are already questions about bitfields and unions. Summary:
Bitfields may have unexpected padding (impossible here as whole type occupied)
Order of "members" depend on compiler (only 2 possible ways here, should be save to assume order depends entirely on endianess, so this may or may not actually help here)
Specific binary layout of bits can be achieved by manual shifting (or e.g. wrappers http://blog.codef00.com/2014/12/06/portable-bitfields-using-c11/) -> Is not an answer here. I need also a specific layout of the values IN the bitfield. So I'm not sure what I get, if I e.g. create a handle as handle = (magic << numIndexBits) | index and save/load this as binary (no endianess conversion) Missing a BigEndian machine for testing.
Note: No C++11, but boost is allowed.
Answer is pretty simple (based on another question I forgot the link to and comments by #Jeremy Friesner ):
As "numbers" are already an abstraction in C++ one can be sure to always have the same bit representation when the variable is in a CPU register (when it is used for anything calculation like) Also bit shifts in C++ are defined in an endian-independent way. This means x << 1 is always equal x * 2 (and hence big-endian)
Only time one get endianess problems is when saving to file, send/recv over network or accessing it from memory differently (e.g. via pointers...)
One cannot use C++ bitfields here, as one cannot be 100% sure about the order of the "entries". Bitfield containers might be ok, if they allow access to the data as a "number".
Savest is (still) using bitshifts, which are very simple in this case (only 2 values) During storing/serialization the number must then be stored in an endian-agnostic way.

an optimized memcpy for small, or fixed size data in gcc

I use memcpy to copy both variable sizes of data and fixed sized data. In some cases I copy small amounts of memory (only a handful of bytes). In GCC I recall that memcpy used to be an intrinsic/builtin. Profiling my code however (with valgrind) I see thousands of calls to the actual "memcpy" function in glibc.
What conditions have to be met to use the builtin function? I can roll my own memcpy quickly, but I'm sure the builtin is more efficient than what I can do.
NOTE: In most cases the amount of data to be copied is available as a compile-time constant.
CXXFLAGS: -O3 -DNDEBUG
The code I'm using now, forcing builtins, if you take off the _builtin prefix the builtin is not used. This is called from various other templates/functions using T=sizeof(type). The sizes that get used are 1, 2, multiples of 4, a few 50-100 byte sizes, and some larger structures.
template<int T>
inline void load_binary_fixm(void *address)
{
if( (at + T) > len )
stream_error();
__builtin_memcpy( address, data + at, T );
at += T;
}
For the cases where T is small, I'd specialise and use a native assignment.
For example, where T is 1, just assign a single char.
If you know the addresses are aligned, use and appropriately sized int type for your platform.
If the addresses are not aligned, you might be better off doing the appropriate number of char assignments.
The point of this is to avoid a branch and keeping a counter.
Where T is big, I'd be surprised if you do better than the library memcpy(), and the function call overhead is probably going to be lost in the noise. If you do want to optimise, look around at the memcpy() implementations around. There are variants that use extended instructions, etc.
Update:
Looking at your actual(!) question about inlining memcpy, questions like compiler versions and platform become relevant. Out of curiosity, have you tried using std::copy, something like this:
template<int T>
inline void load_binary_fixm(void *address)
{
if( (at + T) > len )
stream_error();
std::copy(at, at + T, static_cast<char*>(address));
at += T;
}

How far to go with a strongly typed language?

Let's say I am writing an API, and one of my functions take a parameter that represents a channel, and will only ever be between the values 0 and 15. I could write it like this:
void Func(unsigned char channel)
{
if(channel < 0 || channel > 15)
{ // throw some exception }
// do something
}
Or do I take advantage of C++ being a strongly typed language, and make myself a type:
class CChannel
{
public:
CChannel(unsigned char value) : m_Value(value)
{
if(channel < 0 || channel > 15)
{ // throw some exception }
}
operator unsigned char() { return m_Value; }
private:
unsigned char m_Value;
}
My function now becomes this:
void Func(const CChannel &channel)
{
// No input checking required
// do something
}
But is this total overkill? I like the self-documentation and the guarantee it is what it says it is, but is it worth paying the construction and destruction of such an object, let alone all the additional typing? Please let me know your comments and alternatives.
If you wanted this simpler approach generalize it so you can get more use out of it, instead of tailor it to a specific thing. Then the question is not "should I make a entire new class for this specific thing?" but "should I use my utilities?"; the latter is always yes. And utilities are always helpful.
So make something like:
template <typename T>
void check_range(const T& pX, const T& pMin, const T& pMax)
{
if (pX < pMin || pX > pMax)
throw std::out_of_range("check_range failed"); // or something else
}
Now you've already got this nice utility for checking ranges. Your code, even without the channel type, can already be made cleaner by using it. You can go further:
template <typename T, T Min, T Max>
class ranged_value
{
public:
typedef T value_type;
static const value_type minimum = Min;
static const value_type maximum = Max;
ranged_value(const value_type& pValue = value_type()) :
mValue(pValue)
{
check_range(mValue, minimum, maximum);
}
const value_type& value(void) const
{
return mValue;
}
// arguably dangerous
operator const value_type&(void) const
{
return mValue;
}
private:
value_type mValue;
};
Now you've got a nice utility, and can just do:
typedef ranged_value<unsigned char, 0, 15> channel;
void foo(const channel& pChannel);
And it's re-usable in other scenarios. Just stick it all in a "checked_ranges.hpp" file and use it whenever you need. It's never bad to make abstractions, and having utilities around isn't harmful.
Also, never worry about overhead. Creating a class simply consists of running the same code you would do anyway. Additionally, clean code is to be preferred over anything else; performance is a last concern. Once you're done, then you can get a profiler to measure (not guess) where the slow parts are.
Yes, the idea is worthwhile, but (IMO) writing a complete, separate class for each range of integers is kind of pointless. I've run into enough situations that call for limited range integers that I've written a template for the purpose:
template <class T, T lower, T upper>
class bounded {
T val;
void assure_range(T v) {
if ( v < lower || upper <= v)
throw std::range_error("Value out of range");
}
public:
bounded &operator=(T v) {
assure_range(v);
val = v;
return *this;
}
bounded(T const &v=T()) {
assure_range(v);
val = v;
}
operator T() { return val; }
};
Using it would be something like:
bounded<unsigned, 0, 16> channel;
Of course, you can get more elaborate than this, but this simple one still handles about 90% of situations pretty well.
No, it is not overkill - you should always try to represent abstractions as classes. There are a zillion reasons for doing this and the overhead is minimal. I would call the class Channel though, not CChannel.
Can't believe nobody mentioned enum's so far. Won't give you a bulletproof protection, but still better than a plain integer datatype.
Looks like overkill, especially the operator unsigned char() accessor. You're not encapsulating data, you're making evident things more complicated and, probably, more error-prone.
Data types like your Channel are usually a part of something more abstracted.
So, if you use that type in your ChannelSwitcher class, you could use commented typedef right in the ChannelSwitcher's body (and, probably, your typedef is going to be public).
// Currently used channel type
typedef unsigned char Channel;
Whether you throw an exception when constructing your "CChannel" object or at the entrance to the method that requires the constraint makes little difference. In either case you're making runtime assertions, which means the type system really isn't doing you any good, is it?
If you want to know how far you can go with a strongly typed language, the answer is "very far, but not with C++." The kind of power you need to statically enforce a constraint like, "this method may only be invoked with a number between 0 and 15" requires something called dependent types--that is, types which depend on values.
To put the concept into pseudo-C++ syntax (pretending C++ had dependent types), you might write this:
void Func(unsigned char channel, IsBetween<0, channel, 15> proof) {
...
}
Note that IsBetween is parameterized by values rather than types. In order to call this function in your program now, you must provide to the compiler the second argument, proof, which must have the type IsBetween<0, channel, 15>. Which is to say, you have to prove at compile-time that channel is between 0 and 15! This idea of types which represent propositions, whose values are proofs of those propositions, is called the Curry-Howard Correspondence.
Of course, proving such things can be difficult. Depending on your problem domain, the cost/benefit ratio can easily tip in favor of just slapping runtime checks on your code.
Whether something is overkill or not often depends on lots of different factors. What might be overkill in one situation might not in another.
This case might not be overkill if you had lots of different functions that all accepted channels and all had to do the same range checking. The Channel class would avoid code duplication, and also improve readability of the functions (as would naming the class Channel instead of CChannel - Neil B. is right).
Sometimes when the range is small enough I will instead define an enum for the input.
If you add constants for the 16 different channels, and also a static method that fetches the channel for a given value (or throws an exception if out of range) then this can work without any additional overhead of object creation per method call.
Without knowing how this code is going to be used, it's hard to say if it's overkill or not or pleasant to use. Try it out yourself - write a few test cases using both approaches of a char and a typesafe class - and see which you like. If you get sick of it after writing a few test cases, then it's probably best avoided, but if you find yourself liking the approach, then it might be a keeper.
If this is an API that's going to be used by many, then perhaps opening it up to some review might give you valuable feedback, since they presumably know the API domain quite well.
In my opinion, I don't think what you are proposing is a big overhead, but for me, I prefer to save the typing and just put in the documentation that anything outside of 0..15 is undefined and use an assert() in the function to catch errors for debug builds. I don't think the added complexity offers much more protection for programmers who are already used to C++ language programming which contains alot of undefined behaviours in its specs.
You have to make a choice. There is no silver bullet here.
Performance
From the performance perspective, the overhead isn't going to be much if at all. (unless you've got to counting cpu cycles) So most likely this shouldn't be the determining factor.
Simplicity/ease of use etc
Make the API simple and easy to understand/learn.
You should know/decide whether numbers/enums/class would be easier for the api user
Maintainability
If you are very sure the channel
type is going to be an integer in
the foreseeable future , I would go
without the abstraction (consider
using enums)
If you have a lot of use cases for a
bounded values, consider using the
templates (Jerry)
If you think, Channel can
potentially have methods make it a
class right now.
Coding effort
Its a one time thing. So always think maintenance.
The channel example is a tough one:
At first it looks like a simple limited-range integer type, like you find in Pascal and Ada. C++ gives you no way to say this, but an enum is good enough.
If you look closer, could it be one of those design decisions that are likely to change? Could you start referring to "channel" by frequency? By call letters (WGBH, come in)? By network?
A lot depends on your plans. What's the main goal of the API? What's the cost model? Will channels be created very frequently (I suspect not)?
To get a slightly different perspective, let's look at the cost of screwing up:
You expose the rep as int. Clients write a lot of code, the interface is either respected or your library halts with an assertion failure. Creating channels is dirt cheap. But if you need to change the way you're doing things, you lose "backward bug-compatibility" and annoy authors of sloppy clients.
You keep it abstract. Everybody has to use the abstraction (not so bad), and everybody is futureproofed against changes in the API. Maintaining backwards compatibility is a piece of cake. But creating channels is more costly, and worse, the API has to state carefully when it is safe to destroy a channel and who is responsible for the decision and the destruction. Worse case scenario is that creating/destroying channels leads to a big memory leak or other performance failure—in which case you fall back to the enum.
I'm a sloppy programmer, and if it were for my own work, I'd go with the enum and eat the cost if the design decision changes down the line. But if this API were to go out to a lot of other programmers as clients, I'd use the abstraction.
Evidently I'm a moral relativist.
An integer with values only ever between 0 and 15 is an unsigned 4-bit integer (or half-byte, nibble. I imagine if this channel switching logic would be implemented in hardware, then the channel number might be represented as that, a 4-bit register).
If C++ had that as a type you would be done right there:
void Func(unsigned nibble channel)
{
// do something
}
Alas, unfortunately it doesn't. You could relax the API specification to express that the channel number is given as an unsigned char, with the actual channel being computed using a modulo 16 operation:
void Func(unsigned char channel)
{
channel &= 0x0f; // truncate
// do something
}
Or, use a bitfield:
#include <iostream>
struct Channel {
// 4-bit unsigned field
unsigned int n : 4;
};
void Func(Channel channel)
{
// do something with channel.n
}
int main()
{
Channel channel = {9};
std::cout << "channel is" << channel.n << '\n';
Func (channel);
}
The latter might be less efficient.
I vote for your first approach, because it's simpler and easier to understand, maintain, and extend, and because it is more likely to map directly to other languages should your API have to be reimplemented/translated/ported/etc.
This is abstraction my friend! It's always neater to work with objects