I have a simple struct that looks like this:
struct Object
{
int x_;
double y_;
};
I am trying to manipulate the raw data of an Object, this is what I've done:
int main()
{
Object my_object;
unsigned char* raw_data = reinterpret_cast<unsigned char*>(&my_object);
int x = 10;
memcpy(raw_data, &x, sizeof(x));
raw_data += sizeof(x);
double y = 20.1;
memcpy(raw_data, &y, sizeof(y));
Object* my_object_ptr = reinterpret_cast<Object *>(raw_data);
std::cout << *(my_object_ptr).x << std::endl; //prints 20 (expected 10)
std::cout << *(my_object_ptr).y << std::endl; //prints Rubbish (expected 20.1)
}
I was expecting that above code will work,,,
What is the real problem? Is this even possible?
You need to use offsetof macro. There were a few more problems too, most importantly you modified raw_data pointer, and then cast the modified value back to Object* pointer, resulting in Undefined Behavior. I chose to remove the raw_data modification (alternative would have been to not cast it back, but to just inspect my_object directly). Here's a fixed code for you, with explanation in comments:
#include <iostream>
#include <cstring> // for memcpy
#include <cstddef> // for offsetof macro
struct Object
{
int x_;
double y_;
};
int main()
{
Object my_object;
unsigned char* raw_data = reinterpret_cast<unsigned char*>(&my_object);
int x = 10;
// 1st memcpy fixed to calculate offset of x_ (even though it is probably 0)
memcpy(raw_data + offsetof(Object, x_), &x, sizeof(x));
//raw_data += offsetof(Object, y_); // if used, add offset of y_ instead of sizeof x
double y = 20.1;
// 2nd memcpy fixed to calculate offset of y_ (offset could be 4 or 8, depends on packing, sizeof int, etc)
memcpy(raw_data + offsetof(Object, y_), &y, sizeof(y));
// cast back to Object* pointer
Object* my_object_ptr = reinterpret_cast<Object *>(raw_data);
std::cout << my_object_ptr->x_ << std::endl; //prints 10
std::cout << my_object_ptr->y_ << std::endl; //prints 20.1
}
This is probably a structure padding issue. If you had double y_ as the first member, you'd probably have seen what you expected. The compiler will pad the structure with extra bytes to make the alignment correct in case the struct is used in an array. Try
#pragma pack(4)
before your struct definition.
The #pragma pack reference for Visual Studio: http://msdn.microsoft.com/en-us/library/2e70t5y1.aspx Your struct is packed to 8 bytes by default, so there's a 4 byte pad between x_ and y_.
Read http://www.catb.org/esr/structure-packing/ to really understand what's going on.
Related
I'm looking for a way to overload operator[] (within a broader SIMD class) to facilitate reading and writing individual elements within a SIMD word (e.g. __m512i). A couple constraints:
Compliant with C++11 (or later)
Compatible with additional intrinsics based code
Not OpenCL/SYCL (which I could, but I can't *sigh*)
Mostly portable across g++, icpc, clang++
Preferably applicable to other SIMD beyond Intel (ARM, IBM, etc...)
(edit) Performance isn't really an issue (not generally used in places where performance matters)
(This rules out things like type punning through pointer casting, and GCC vector types.)
Based heavily on Scott Meyers' "More Effective C++" (Item 30), and other code I've come up with the following MVC code that seems "right", that seems to work, but also seems over complicated. (The "proxy" approach is meant to deal with the left/right hand operator[] usage, and the "memcpy" is meant to deal with the type punning/C++ standard issue.)
I'm wonder if someone has a better solution (and can explain it so I learn something ;^))
#include <iostream>
#include <cstring>
#include "immintrin.h"
using T = __m256i; // SIMD type
using Te = unsigned int; // SIMD element type
class SIMD {
class SIMDProxy;
public :
const SIMDProxy operator[](int index) const {
std::cout << "SIMD::operator[] const" << std::endl;
return SIMDProxy(const_cast<SIMD&>(*this), index);
}
SIMDProxy operator[](int index){
std::cout << "SIMD::operator[]" << std::endl;
return SIMDProxy(*this, index);
}
Te get(int index) {
std::cout << "SIMD::get" << std::endl;
alignas(T) Te tmp[8];
std::memcpy(tmp, &value, sizeof(T)); // _mm256_store_si256(reinterpret_cast<__m256i *>(tmp), c.value);
return tmp[index];
}
void set(int index, Te x) {
std::cout << "SIMD::set" << std::endl;
alignas(T) Te tmp[8];
std::memcpy(tmp, &value, sizeof(T)); // _mm256_store_si256(reinterpret_cast<__m256i *>(tmp), c.value);
tmp[index] = x;
std::memcpy(&value, tmp, sizeof(T)); // c.value = _mm256_load_si256(reinterpret_cast<__m256i const *>(tmp));
}
void splat(Te x) {
alignas(T) Te tmp[8];
std::memcpy(tmp, &value, sizeof(T));
for (int i=0; i<8; i++) tmp[i] = x;
std::memcpy(&value, tmp, sizeof(T));
}
void print() {
alignas(T) Te tmp[8];
std::memcpy(tmp, &value, sizeof(T));
for (int i=0; i<8; i++) std::cout << tmp[i] << " ";
std::cout << std::endl;
}
protected :
private :
T value;
class SIMDProxy {
public :
SIMDProxy(SIMD & c_, int index_) : c(c_), index(index_) {};
// lvalue access
SIMDProxy& operator=(const SIMDProxy& rhs) {
std::cout << "SIMDProxy::=SIMDProxy" << std::endl;
c.set(rhs.index, rhs.c.get(rhs.index));
return *this;
}
SIMDProxy& operator=(Te x) {
std::cout << "SIMDProxy::=T" << std::endl;
c.set(index,x);
return *this;
}
// rvalue access
operator Te() const {
std::cout << "SIMDProxy::()" << std::endl;
return c.get(index);
}
private:
SIMD& c; // SIMD this proxy refers to
int index; // index of element we want
};
friend class SIMDProxy; // give SIMDProxy access into SIMD
};
/** a little main to exercise things **/
int
main(int argc, char *argv[])
{
SIMD x, y;
Te a = 3;
x.splat(1);
x.print();
y.splat(2);
y.print();
x[0] = a;
x.print();
y[1] = a;
y.print();
x[1] = y[1];
x.print();
}
Your code is very inefficient. Normally these SIMD types are not present anywhere in memory, they are hardware registers, they don’t have addresses and you can’t pass them to memcpy(). Compilers pretend very hard they’re normal variables that’s why your code compiles and probably works, but it’s slow, you’re doing roundtrips from registers to memory and back all the time.
Here’s how I would do that, assuming AVX2 and integer lanes.
class SimdVector
{
__m256i val;
alignas( 64 ) static const std::array<int, 8 + 7> s_blendMaskSource;
public:
int operator[]( size_t lane ) const
{
assert( lane < 8 );
// Move lane index into lowest lane of vector register
const __m128i shuff = _mm_cvtsi32_si128( (int)lane );
// Permute the vector so the lane we need is moved to the lowest lane
// _mm256_castsi128_si256 says "the upper 128 bits of the result are undefined",
// and we don't care indeed.
const __m256i tmp = _mm256_permutevar8x32_epi32( val, _mm256_castsi128_si256( shuff ) );
// Return the lowest lane of the result
return _mm_cvtsi128_si32( _mm256_castsi256_si128( tmp ) );
}
void setLane( size_t lane, int value )
{
assert( lane < 8 );
// Load the blending mask
const int* const maskLoadPointer = s_blendMaskSource.data() + 7 - lane;
const __m256i mask = _mm256_loadu_si256( ( const __m256i* )maskLoadPointer );
// Broadcast the source value into all lanes.
// The compiler will do equivalent of _mm_cvtsi32_si128 + _mm256_broadcastd_epi32
const __m256i broadcasted = _mm256_set1_epi32( value );
// Use vector blending instruction to set the desired lane
val = _mm256_blendv_epi8( val, broadcasted, mask );
}
template<size_t lane>
int getLane() const
{
static_assert( lane < 8 );
// That thing is not an instruction;
// compilers emit different ones based on the index
return _mm256_extract_epi32( val, (int)lane );
}
template<size_t lane>
void setLane( int value )
{
static_assert( lane < 8 );
val = _mm256_insert_epi32( val, value, (int)lane );
}
};
// Align by 64 bytes to guarantee it's contained within a cache line
alignas( 64 ) const std::array<int, 8 + 7> SimdVector::s_blendMaskSource
{
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0
};
For ARM it’s different. If lane index is known at compile time, see vgetq_lane_s32 and vsetq_lane_s32 intrinsics.
For setting lanes on ARM you can use the same broadcast + blend trick. Broadcast is vdupq_n_s32. An approximate equivalent of vector blend is vbslq_s32, it handles every bit independently, but for this use case it’s equally suitable because -1 has all 32 bits set.
For extracting either write a switch, or store the complete vector into memory, not sure which of these two is more efficient.
Of the original approaches (memcpy, intrinsic load/store), and the additional suggestions (user defined union-punning, user defined vector type) it seems like the intrinsic approach may have a small advantage. This is based on some quick examples I attempted to code up in Godbolt (https://godbolt.org/z/5zdbKe).
The "best" for writing to an element looks something like this.
__m256i foo2(__m256i x, unsigned int a, int index)
{
alignas(__m256i) unsigned int tmp[8];
_mm256_store_si256(reinterpret_cast<__m256i *>(tmp), x);
tmp[index] = a;
__m256i z = _mm256_load_si256(reinterpret_cast<__m256i const *>(tmp));
return z;
}
If you only care about g++/clang++/icc compatibility, you can just use the __attribute__ which these compilers use internally to define their intrinsic instructions:
typedef int32_t int32x16_t __attribute__((vector_size(16*sizeof(int32_t)))) __attribute__((aligned(16*sizeof(int32_t))));
When it makes sense (and is possible on the given architecture), variables will be stored in vector registers. Also, the compilers provide a read/writeable operator[] for this typedef (which should get optimized, if the index is known at compile-time).
Is the following the best way to pack a float's bits into a uint32? This might be a fast and easy yes, but I want to make sure there's no better way, or that exchanging the value between processes doesn't introduce a weird wrinkle.
"Best" in my case, is that it won't ever break on a compliant C++ compiler (given the static assert), can be packed and unpacked between two processes on the same computer, and is as fast as copying a uint32 into another uint32.
Process A:
static_assert(sizeof(float) == sizeof(uint32) && alignof(float) == alignof(uint32), "no");
...
float f = 0.5f;
uint32 buffer[128];
memcpy(buffer + 41, &f, sizeof(uint32)); // packing
Process B:
uint32 * buffer = thisUint32Is_ReadFromProcessA(); // reads "buffer" from process A
...
memcpy(&f, buffer + 41, sizeof(uint32)); // unpacking
assert(f == 0.5f);
Yes, this is the standard way to do type punning. Cppreferences's page on memcpy even includes an example showing how you can use it to reinterpret a double as an int64_t
#include <iostream>
#include <cstdint>
#include <cstring>
int main()
{
// simple usage
char source[] = "once upon a midnight dreary...", dest[4];
std::memcpy(dest, source, sizeof dest);
for (char c : dest)
std::cout << c << '\n';
// reinterpreting
double d = 0.1;
// std::int64_t n = *reinterpret_cast<std::int64_t*>(&d); // aliasing violation
std::int64_t n;
std::memcpy(&n, &d, sizeof d); // OK
std::cout << std::hexfloat << d << " is " << std::hex << n
<< " as an std::int64_t\n";
}
ouput
o
n
c
e
0x1.999999999999ap-4 is 3fb999999999999a as an std::int64_t
As long as the asserts pass (your are writing and reading the correct number of bytes) then the operation is safe. You can't pack a 64 bit object in a 32 bit object, but you can pack one 32 bit object into another 32 bit object, as long they are trivially copyable
Or this:
union TheUnion {
uint32 theInt;
float theFloat;
};
TheUnion converter;
converter.theFloat = myFloatValue;
uint32 myIntRep = converter.theInt;
I don't know if this is better, but it's a different way to look at it.
I just discovered some dodgy problems when i was interleaving some floats. I've simplified the issue down and tried some tests
#include <iostream>
#include <vector>
std::vector<float> v; // global instance
union{ // shared memory space
float f; // to store data in interleaved float array
unsigned int argb; // int color value
}color; // global instance
int main(){
std::cout<<std::hex; // print hexadecimal
color.argb=0xff810000; // NEED A==ff AND R>80 (idk why)
std::cout<<color.argb<<std::endl; // NEED TO PRINT (i really dk why)
v.insert(v.end(),{color.f,0.0f,0.0f}); // color, x, y... (need the x, y too. heh..)
color.f=v[0]; // read float back (so we can see argb data)
std::cout<<color.argb<<std::endl; // ffc10000 (WRONG!)
}
the program prints
ff810000
ffc10000
If someone can show me i'm just being dumb somewhere that'd be great.
update: turned off optimizations
#include <iostream>
union FLOATINT{float f; unsigned int i;};
int main(){
std::cout<<std::hex; // print in hex
FLOATINT a;
a.i = 0xff810000; // store int
std::cout<<a.i<<std::endl; // ff810000
FLOATINT b;
b.f = a.f; // store float
std::cout<<b.i<<std::endl; // ffc10000
}
or
#include <iostream>
int main(){
std::cout<<std::hex; // print in hex
unsigned int i = 0xff810000; // store int
std::cout<<i<<std::endl; // ff810000
float f = *(float*)&i; // store float from int memory
unsigned int i2 = *(unsigned int*)&f; // store int from float memory
std::cout<<i2<<std::endl; // ffc10000
}
solution:
#include <iostream>
int main(){
std::cout<<std::hex;
unsigned int i=0xff810000;
std::cout<<i<<std::endl; // ff810000
float f; memcpy(&f, &i, 4);
unsigned int i2; memcpy(&i2, &f, 4);
std::cout<<i2<<std::endl; // ff810000
}
The behavior you're seeing is well defined IEEE floating point math.
The value you're storing in argb, when interpreted as a float will be a SNaN (Signaling NaN). When this SNaN value is loaded into a floating point register, it will be converted to a QNaN (Quiet NaN) by setting the most significant fraction bit to a 1 (and will raise an exception if floating point exceptions are unmasked).
This load will change your value to from ff810000 to ffc10000.
Writing to the int and then reading from the float in the union causes UB. If you want to create a vector of mixed value types, make a struct to hold them. Also, don't use unsigned int when you need exactly 32 bits. Use uint32_t.
#include <iostream>
#include <vector>
struct gldata {
uint32_t argb;
float x;
float y;
};
std::vector<gldata> v;
int main() {
std::cout << std::hex; // print hexadecimal
v.emplace_back(gldata{0xff810000, 0.0f, 0.0f});
std::cout << v[0].argb << "\n"; // 0xff810000
}
I have an application where I need to save as much of memory as possible. I need to store a large amount of data that can take exactly three possible values. So, I have been trying to use a 2 bit sized type.
One possibility is using bit fields. I could do
struct myType {
uint8_t twoBits : 2;
}
This is a suggestion from this thread.
However, everywhere where I have used int variables prior to this, I would need to change their usage by appending a .twoBits. I checked if I can create a bit field outside of a struct, such as
uint8_t twoBits : 2;
but this thread says it is not possible. However,that thread is specific to C, so I am not sure if it applied to C++.
Is there a clean way I can define a 2-bit type, so that by simply replacing int with my type, I can run the program correctly? Or is using bit fields the only possible way?
CPU, and thus the memory, the bus, and the compiler too, uses only bytes or groups of bytes. There's no way to store a 2-bits type without storing also the other 6 remaining bits.
What you can so is define a struct that only uses some bits. But we aware that it will not save memory.
You can pack several x-bits types in a struct, as you already know. Or you can do bits operations to pack/unpack them into a integer type.
Is there a clean way I can define a 2-bit type, so that by simply
replacing int with my type, I can run the program correctly? Or is
using bit fields the only possible way?
You can try to make the struct as transparent as possible by providing implicit conversion operators and constructors:
#include <cstdint>
#include <iostream>
template <std::size_t N, typename T = unsigned>
struct bit_field {
T rep : N;
operator T() { return rep; }
bit_field(T i) : rep{ i } { }
bit_field() = default;
};
using myType = bit_field<2, std::uint8_t>;
int main() {
myType mt;
mt = 3;
std::cout << mt << "\n";
}
So objects of type my_type somewhat behave like real 3-bit unsigned integers, despite having more than 3 bits.
Of course, the residual bits are unused, but as single bits are not addressable on most systems, this is the best way to go.
I'm not convinced that you will save anything with your existing structure, as the surrounding structure still gets rounded up to a whole number of bytes.
You can write the following to squeeze 4 2-bit counters into 1 byte, but as you say, you have to name them myInst.f0:
struct MyStruct
{
ubyte_t f0:2,
f1:2,
f2:2,
f3:2;
} myInst;
In c and c++98, you can declare this anonymous, but this usage is deprecated. You can now access the 4 values directly by name:
struct
{ // deprecated!
ubyte_t f0:2,
f1:2,
f2:2,
f3:2;
};
You could declare some sort of template that wraps a single instance with an operator int and operator =(int), and then define a union to put the 4 instances at the same location, but again anonymous unions are deprecated. However you could then declare references to your 4 values, but then you are paying for the references, which are bigger than the bytes you were trying to save!
template <class Size,int offset,int bits>
struct Bitz
{
Size ignore : offset,
value : bits;
operator Size()const { return value; }
Size operator = (Size val) { return (value = val); }
};
template <class Size,int bits>
struct Bitz0
{ // I know this can be done better
Size value : bits;
operator Size()const { return value; }
Size operator = (Size val) { return (value = val); }
};
static union
{ // Still deprecated!
Bitz0<char, 2> F0;
Bitz<char, 2, 2> F1;
Bitz<char, 4, 2> F2;
Bitz<char, 6, 2> F3;
};
union
{
Bitz0<char, 2> F0;
Bitz<char, 2, 2> F1;
Bitz<char, 4, 2> F2;
Bitz<char, 6, 2> F3;
} bitz;
Bitz0<char, 2>& F0 = bitz.F0; /// etc...
Alternatively, you could simply declare macros to replace the the dotted name with a simple name (how 1970s):
#define myF0 myInst.f0
Note that you can't pass bitfields by reference or pointer, as they don't have a byte address, only by value and assignment.
A very minimal example of a bit array with a proxy class that looks (for the most part) like you were dealing with an array of very small integers.
#include <cstdint>
#include <iostream>
#include <vector>
class proxy
{
uint8_t & byte;
unsigned int shift;
public:
proxy(uint8_t & byte,
unsigned int shift):
byte(byte),
shift(shift)
{
}
proxy(const proxy & src):
byte(src.byte),
shift(src.shift)
{
}
proxy & operator=(const proxy &) = delete;
proxy & operator=(unsigned int val)
{
if (val <=3)
{
uint8_t wipe = 3 << shift;
byte &= ~wipe;
byte |= val << shift;
}
// might want to throw std::out_of_range here
return *this;
}
operator int() const
{
return (byte >> shift) &0x03;
}
};
Proxy holds a reference to a byte and knows how to extract two specific bits and look like an int to anyone who uses it.
If we wrap an array of bits packed into bytes with a class that returns this proxy object wrapped around the appropriate byte, we now have something that looks a lot like an array of very small ints.
class bitarray
{
size_t size;
std::vector<uint8_t> data;
public:
bitarray(size_t size):
size(size),
data((size + 3) / 4)
{
}
proxy operator[](size_t index)
{
return proxy(data[index/4], (index % 4) * 2);
}
};
If you want to extend this and go the distance, Writing your own STL Container should help you make a fully armed and operational bit-packed array.
There's room for abuse here. The caller can hold onto a proxy and get up to whatever manner of evil this allows.
Use of this primitive example:
int main()
{
bitarray arr(10);
arr[0] = 1;
arr[1] = 2;
arr[2] = 3;
arr[3] = 1;
arr[4] = 2;
arr[5] = 3;
arr[6] = 1;
arr[7] = 2;
arr[8] = 3;
arr[9] = 1;
std::cout << arr[0] << std::endl;
std::cout << arr[1] << std::endl;
std::cout << arr[2] << std::endl;
std::cout << arr[3] << std::endl;
std::cout << arr[4] << std::endl;
std::cout << arr[5] << std::endl;
std::cout << arr[6] << std::endl;
std::cout << arr[7] << std::endl;
std::cout << arr[8] << std::endl;
std::cout << arr[9] << std::endl;
}
Simply, build on top of bitset, something like:
#include<bitset>
#include<iostream>
using namespace std;
template<int N>
class mydoublebitset
{
public:
uint_least8_t operator[](size_t index)
{
return 2 * b[index * 2 + 1] + b[index * 2 ];
}
void set(size_t index, uint_least8_t store)
{
switch (store)
{
case 3:
b[index * 2] = 1;
b[index * 2 + 1] = 1;
break;
case 2:
b[index * 2] = 0;
b[index * 2 + 1] = 1;
break;
case 1:
b[index * 2] = 0;
b[index * 2 + 1] = 1;
break;
case 0:
b[index * 2] = 0;
b[index * 2 + 1] = 0;
break;
default:
throw exception();
}
}
private:
bitset<N * 2> b;
};
int main()
{
mydoublebitset<12> mydata;
mydata.set(0, 0);
mydata.set(1, 2);
mydata.set(2, 2);
cout << (unsigned int)mydata[0] << (unsigned int)mydata[1] << (unsigned int)mydata[2] << endl;
system("pause");
return 0;
}
Basically use a bitset with twice the size and index it accordingly. its simpler and memory efficient as is required by you.
This was an interview question:
Say there is a class having only an int member. You do not know how many bytes the int will occupy. And you cannot view the class implementation (say it's an API). But you can create an object of it. How would you find the size needed for int without using sizeof.
He wouldn't accept using bitset, either.
Can you please suggest the most efficient way to find this out?
The following program demonstrates a valid technique to compute the size of an object.
#include <iostream>
struct Foo
{
int f;
};
int main()
{
// Create an object of the class.
Foo foo;
// Create a pointer to it.
Foo* p1 = &foo;
// Create another pointer, offset by 1 object from p1
// It is legal to compute (p1+1) but it is not legal
// to dereference (p1+1)
Foo* p2 = p1+1;
// Cast both pointers to char*.
char* cp1 = reinterpret_cast<char*>(p1);
char* cp2 = reinterpret_cast<char*>(p2);
// Compute the size of the object.
size_t size = (cp2-cp1);
std::cout << "Size of Foo: " << size << std::endl;
}
Using pointer algebra:
#include <iostream>
class A
{
int a;
};
int main() {
A a1;
A * n1 = &a1;
A * n2 = n1+1;
std::cout << int((char *)n2 - (char *)n1) << std::endl;
return 0;
}
Yet another alternative without using pointers. You can use it if in the next interview they also forbid pointers. Your comment "The interviewer was leading me to think on lines of overflow and underflow" might also be pointing at this method or similar.
#include <iostream>
int main() {
unsigned int x = 0, numOfBits = 0;
for(x--; x; x /= 2) numOfBits++;
std::cout << "number of bits in an int is: " << numOfBits;
return 0;
}
It gets the maximum value of an unsigned int (decrementing zero in unsigned mode) then subsequently divides by 2 until it reaches zero. To get the number of bytes, divide by CHAR_BIT.
Pointer arithmetic can be used without actually creating any objects:
class c {
int member;
};
c *ptr = 0;
++ptr;
int size = reinterpret_cast<int>(ptr);
Alternatively:
int size = reinterpret_cast<int>( static_cast<c*>(0) + 1 );