While debugging my code I have noticed that something strange is going on, So I have added more lines and got confused even more:
#include <iostream>
#include <memory>
struct Node
{
size_t size = 0;
};
class MallocMetadata {
public:
size_t size; /** Effective allocation - requested size **/
bool is_free;
MallocMetadata *next;
MallocMetadata *prev;
MallocMetadata *next_free;
MallocMetadata *prev_free;
};
int main()
{
size_t size = 0;
auto node = std::make_shared<Node>();
int tmp_res=node->size - size - sizeof(MallocMetadata);
bool test=(node->size - size - sizeof(MallocMetadata)) < 128;
bool test1=tmp_res<128;
std::cout << tmp_res << "\n";
std::cout << test << "\n";
std::cout << test1 << "\n";
}
After running these 3 lines I saw:
tmp_res=-48
test = false
test1 = true
How is this even possible! why test is false, -48 is smaller than 128
Here's a proof:
It looks like the part node->size - size - sizeof(MallocMetadata) is calculated in unsigned integer.
When calculation of unsigned integer is going to be negative, the maximum number of the type plus one is added and the result wraparounds.
Therefore, the value looks like being big value (128 or more), making the expression (node->size - size - sizeof(MallocMetadata)) < 128 false.
In the other hands, int tmp_res=node->size - size - sizeof(MallocMetadata); will convert the big value to int. int is signed and it may give different value than the expression above that doesn't perform convertion to int.
I believe whenever there is an unsigned value in an expression, the result tends to be unsigned aswell.
size_t size = 0;
auto val = 1 + 100 + (-100) + size;
std::cout << typeid(val).name();
'val' will be a size_t aswell. So in your case, you're trying to store a negative value in size_t which causes overflow.
You can explicitly typecast it to a signed integer and that should be enough if I'm not mistaken.Like so:
bool test=int(node->size - size - sizeof(MallocMetadata)) < 128;
Related
I was playing around with memcpy when I stumbled on a strange result, where a memcpy that is called on the same pointer of memory after bool memcpy gives unexpected result.
I created a simple test struct that has a bunch of different type variables. I cast the struct into unsigned char pointer and then using memcpy I copy data from that pointer into separate variables. I tried playing around the offset of memcpy and shifting the int memcpy before bool (changed the layout of test struct so that the int would go before the bool too). Suprisingly the shifting fixed the problem.
// Simple struct containing 3 floats
struct vector
{
float x;
float y;
float z;
};
// My test struct
struct test2
{
float a;
vector b;
bool c;
int d;
};
int main()
{
// I create my structure on the heap here and assign values
test2* test2ptr = new test2();
test2ptr->a = 50;
test2ptr->b.x = 100;
test2ptr->b.y = 101;
test2ptr->b.z = 102;
test2ptr->c = true;
test2ptr->d = 5;
// Then turn the struct into an array of single bytes
unsigned char* data = (unsigned char*)test2ptr;
// Variable for keeping track of the offset
unsigned int offset = 0;
// Variables that I want the memory copied into they
float a;
vector b;
bool c;
int d;
// I copy the memory here in the same order as it is defined in the struct
std::memcpy(&a, data, sizeof(float));
// Add the copied data size in bytes to the offset
offset += sizeof(float);
std::memcpy(&b, data + offset, sizeof(vector));
offset += sizeof(vector);
std::memcpy(&c, data + offset, sizeof(bool));
offset += sizeof(bool);
// It all works until here the results are the same as the ones I assigned
// however the int value becomes 83886080 instead of 5
// moving this above the bool memcpy (and moving the variable in the struct too) fixes the problem
std::memcpy(&d, data + offset, sizeof(int));
offset += sizeof(int);
return 0;
}
So I expected the value of d to be 5 however it becomes 83886080 which I presume is just random uninitialized memory.
You ignore the padding of your data in a struct.
Take a look on the following simplified example:
struct X
{
bool b;
int i;
};
int main()
{
X x;
std::cout << "Address of b " << (void*)(&x.b) << std::endl;
std::cout << "Address of i " << (void*)(&x.i) << std::endl;
}
This results on my PC with:
Address of b 0x7ffce023f548
Address of i 0x7ffce023f54c
As you see, the bool value in the struct takes 4 bytes here even it uses less for its content. The compiler must add padding bytes to the struct to make it possible the cpu can access the data directly. If you have the data arranged linear as written in your code, the compiler have to generate assembly instructions on all access to align the data later which slows down your program a lot.
You can force the compiler to do that by adding pragma pack or something similar with your compiler. All the pragma things are compiler specific!
For your program, you have to use the address if the data for the memcpy and not the size of the data element before the element you want to access as this ignore padding bytes.
If I add a pragma pack(1) before my program, the output is:
Address of b 0x7ffd16c79cfb
Address of i 0x7ffd16c79cfc
As you can see, there are no longer padding bytes between the bool and the int. But the code which will access i later will be very large and slow! So avoid use of #pragma pack at all!
You've got the answer you need so I'll not get into detail. I just made an extraction function with logging to make it easier to follow what's happening.
#include <cstring>
#include <iostream>
#include <memory>
// Simple struct containing 3 floats
struct vector {
float x;
float y;
float z;
};
// My test struct
struct test2 {
float a;
vector b;
bool c;
int d;
};
template<typename T>
void extract(T& dest, unsigned char* data, size_t& offset) {
std::uintptr_t dp = reinterpret_cast<std::uintptr_t>(data + offset);
size_t align_overstep = dp % alignof(T);
std::cout << "sizeof " << sizeof(T) << " alignof " << alignof(T) << " data "
<< dp << " mod " << align_overstep << "\n";
if(align_overstep) {
size_t missing = alignof(T) - align_overstep;
std::cout << "misaligned - adding " << missing << " to align it again\n";
offset += missing;
}
std::memcpy(&dest, data + offset, sizeof(dest));
offset += sizeof(dest);
}
int main() {
std::cout << std::boolalpha;
// I create my structure on the heap here and assign values
test2* test2ptr = new test2();
test2ptr->a = 50;
test2ptr->b.x = 100;
test2ptr->b.y = 101;
test2ptr->b.z = 102;
test2ptr->c = true;
test2ptr->d = 5;
// Then turn the struct into an array of single bytes
unsigned char* data = reinterpret_cast<unsigned char*>(test2ptr);
// Variable for keeping track of the offset
size_t offset = 0;
// Variables that I want the memory copied into they
float a;
vector b;
bool c;
int d;
// I copy the memory here in the same order as it is defined in the struct
extract(a, data, offset);
std::cout << "a " << a << "\n";
extract(b, data, offset);
std::cout << "b.x " << b.x << "\n";
std::cout << "b.y " << b.y << "\n";
std::cout << "b.z " << b.z << "\n";
extract(c, data, offset);
std::cout << "c " << c << "\n";
extract(d, data, offset);
std::cout << "d " << d << "\n";
std::cout << offset << "\n";
delete test2ptr;
}
Possible output
sizeof 4 alignof 4 data 12840560 mod 0
a 50
sizeof 12 alignof 4 data 12840564 mod 0
b.x 100
b.y 101
b.z 102
sizeof 1 alignof 1 data 12840576 mod 0
c true
sizeof 4 alignof 4 data 12840577 mod 1
misaligned - adding 3 to align it again
d 5
24
There are apparently three padding bytes between the bool and the subsequent int. This is allowed by the standard due to alignment considerations (accessing a 4 byte int that is not aligned on a 4 byte boundary may be slow or crash on some systems).
So when you do offset += sizeof(bool), you are not incrementing enough. The int follows 4 bytes after, not 1. The result is that the 5 is not the first byte you read but the last one - you are reading three padding bytes plus the first one from test2ptr->d into d. And it is no coincidence that 83886080 = 2^24 * 5 (the padding bytes were apparently all zeros).
I recently ran into this weird C++ bug that I could not understand. Here's my code:
#include <bits/stdc++.h>
using namespace std;
typedef vector <int> vi;
typedef pair <int, int> ii;
#define ff first
#define ss second
#define pb push_back
const int N = 2050;
int n, k, sum = 0;
vector <ii> a;
vi pos;
int main (void) {
cin >> n >> k;
for (int i = 1; i < n+1; ++i) {
int val;
cin >> val;
a.pb(ii(val, i));
}
cout << a.size()-1 << " " << k << " " << a.size()-k-1 << "\n";
}
When I tried out with test:
5 5
1 1 1 1 1
it returned:
4 5 4294967295
but when I changed the declaration from:
int n, k, sum = 0;
to:
long long n, k, sum = 0;
then the program returned the correct value which was:
4 5 -1
I could not figure out why the program behaved like that since -1 should not exceed an integer value. Can anyone explain this to me? I'm really appreciated your kind helps.
Thanks
Obviously, on your machine, your size_t is a 32-bit integer, whereas long long is 64 bit. size_t always is an unsigned type, so you get:
cout << a.size() - 1
// ^ unsigned ^ promoted to unsigned
// output as uint32_t
// ^ (!)
a.size() - k - 1
// ^ promoted to long long, as of smaller size!
// -> overall expression is int64_t
// ^ (!)
You would not have seen any difference in the two values printed (would have been 18446744073709551615) if size_t was 64 bit as well, as then the signed long long k (int64_t) would have promoted to unsigned (uint64_t) instead.
Be aware that static_cast<UnsignedType>(-1) always evaluates (according to C++ conversion rules) to std::numeric_limits<UnsignedType>::max()!
Side note about size_t: This is defined as an unsigned integral type large enough to hold the maximum size you can allocate on your system for an object, so the size in bits is hardware dependent and in the end, correlates with the size in bits of the memory address bus (first power of two not smaller than).
vector::size returns size_t (unsigned), the expression a.size()-k-1 evaluates to an unsigned type, so you end up with an underflow.
I am trying use auto to infer the type.
for (auto i = remaining_group.size() - 1; i >= 0; --i) {
std::cout << i;
}
I get very large number like 18446744073709534800 which is not expected. When I change auto to int it is the number I expected between 0 and 39.
Is there any reason auto will fail here?
remaining_group's type is a std::vector<lidar_point> and lidar_point is struct like:
struct LidarPoint {
float x;
float y;
float z;
uint8_t field1;
double field2;
uint8_t field3;
uint16_t field4;
}
When using auto the type of i would be std::vector::size_type, which is an unsigned integer type. That means the condition i >= 0; would be always true, and if overflow happens you'll get some large numbers.
Unsigned integer arithmetic is always performed modulo 2n
where n is the number of bits in that particular integer. E.g. for unsigned int, adding one to UINT_MAX gives 0, and subtracting one from 0 gives UINT_MAX.
Simple reproduction of problem:
#include <iostream>
size_t size() {return 1;}
int main() {
for (auto i = size() - 1; i >= 0; --i) {
std::cout << i << std::endl;
}
}
size() got type size_t and literal constant 1 would be promoted to size_t, in result auto would become size_t, which cannot be less than zero, resulting in infinite loop and underflow of i.
If you need a reverse index loop, use operator -->
When you write a normal index loop, you write it with 0, size, and <.
When you write a normal reverse index loop, things become a bit wonky: you need size - 1, >=, 0, and you can't use unsigned index, because unsigned i is always >= 0 so your check i >= 0 always returns true, your loop could run forever.
With fake operator "goes to", you can use 0, size, and > to write the reverse index loop, and it doesn't matter if i is signed or unsigned:
for (auto i = a.size(); i --> 0; ) //or just i--, or i --> 1, i --> 10...
std::cout << i << ' ';
Input is a bitarray stored in contiguous memory with 1 bit of the bitarray per 1 bit of memory.
Output is an array of the indices of set bits of the bitarray.
Example:
bitarray: 0000 1111 0101 1010
setA: {4,5,6,7,9,11,12,14}
setB: {2,4,5,7,9,10,11,12}
Getting either set A or set B is fine.
The set is stored as an array of uint32_t so each element of the set is an unsigned 32 bit integer in the array.
How to do this about 5 times faster on a single cpu core?
current code:
#include <iostream>
#include <vector>
#include <time.h>
using namespace std;
template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
uint32_t i;
uint32_t base = 0;
uint32_t * ptr_set_new = ptr_set;
uint32_t size = v.capacity();
for(i = 0; i < size; i++){
find_set_bit(v[i], ptr_set_new, base);
base += 8*sizeof(uint32_t);
}
return (ptr_set_new - ptr_set);
}
inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
// Find the set bits in a uint32_t
int k = base;
while(n){
if (n & 1){
*(ptr_set) = k;
ptr_set++;
}
n = n >> 1;
k++;
}
}
template <typename T>
void rand_vector(T& v){
srand(time(NULL));
int i;
int size = v.capacity();
for (i=0;i<size;i++){
v[i] = rand();
}
}
template <typename T>
void print_vector(T& v, int size_in = 0){
int i;
int size;
if (size_in == 0){
size = v.capacity();
} else {
size = size_in;
}
for (i=0;i<size;i++){
cout << v[i] << ' ';
}
cout << endl;
}
int main(void){
const int test_size = 6000;
vector<uint32_t> vec(test_size);
vector<uint32_t> set(test_size*sizeof(uint32_t)*8);
rand_vector(vec);
//for (int i; i < 64; i++) vec[i] = -1;
//cout << "input" << endl;
print_vector(vec);
//cout << "calculate result" << endl;
int i;
int rep = 10000;
uint32_t res_size;
struct timespec tp_start, tp_end;
clock_gettime(CLOCK_MONOTONIC, &tp_start);
for (i=0;i<rep;i++){
res_size = bitarray2set(vec, set.data());
}
clock_gettime(CLOCK_MONOTONIC, &tp_end);
double timing;
const double nano = 0.000000001;
timing = ((double)(tp_end.tv_sec - tp_start.tv_sec )
+ (tp_end.tv_nsec - tp_start.tv_nsec) * nano) /(rep);
cout << "timing per cycle: " << timing << endl;
cout << "print result" << endl;
//print_vector(set, res_size);
}
result (compiled with icc -O3 code.cpp -lrt)
...
timing per cycle: 0.000739613 (7.4E-4).
print result
0.0008 seconds to convert 768000 bits to set. But there are at least 10,000 arrays of 768,000 bits in each cycle. That is 8 seconds per cycle. That is slow.
The cpu has popcnt instruction and sse4.2 instruction set.
Thanks.
Update
template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
uint32_t i;
uint32_t base = 0;
uint32_t * ptr_set_new = ptr_set;
uint32_t size = v.capacity();
uint32_t * ptr_v;
uint32_t * ptr_v_end = &(v[size]);
for(ptr_v = v.data(); ptr_v < ptr_v_end; ++ptr_v){
while(*ptr_v) {
*ptr_set_new++ = base + __builtin_ctz(*ptr_v);
(*ptr_v) &= (*ptr_v) - 1; // zeros the lowest 1-bit in n
}
base += 8*sizeof(uint32_t);
}
return (ptr_set_new - ptr_set);
}
This updated version uses the inner loop provided by rhashimoto. I don't know if the inlining actually makes the function slower (i never thought that can happen!). The new timing is 1.14E-5 (compiled by icc -O3 code.cpp -lrt, and benchmarked on random vector).
Warning:
I just found that reserving instead of resizing a std::vector, and then write directly to the vector's data through raw pointing is a bad idea. Resizing first and then use raw pointer is fine though. See Robᵩ's answer at Resizing a C++ std::vector<char> without initializing data I am going to just use resize instead of reserve and stop worrying about the time that resize wastes by calling constructor of each element of the vector... at least vectors actually uses contiguous memory, like a plain array (Are std::vector elements guaranteed to be contiguous?)
I notice that you use .capacity() when you probably mean to use .size(). That could make you do extra unnecessary work, as well as giving you the wrong answer.
Your loop in find_set_bit() iterates over all 32 bits in the word. You can instead iterate only over each set bit and use the BSF instruction to determine the index of the lowest bit. GCC has an intrinsic function __builtin_ctz() to generate BSF or the equivalent - I think that the Intel compiler also supports it (you can inline assembly if not). The modified function would look like this:
inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
// Find the set bits in a uint32_t
while(n) {
*ptr_set++ = base + __builtin_ctz(n);
n &= n - 1; // zeros the lowest 1-bit in n
}
}
On my Linux machine, compiling with g++ -O3, replacing that function drops the reported time from 0.000531434 to 0.000101352.
There are quite a few ways to find a bit index in the answers to this question. I do think that __builtin_ctz() is going to be the best choice for you, though. I don't believe that there is a reasonable SIMD approach to your problem, as each input word produces a variable amount of output.
As suggested by #davidbak, you could use a table lookup to process 4 elements of the bitmap at once.
Each lookup produces a variable-sized chunk of set members, which we can handle by using popcnt.
#rhashimoto's scalar ctz-based suggestion will probably do better with sparse bitsets that have lots of zeros, but this should be better when there are a lot of set bits.
I'm thinking something like
// a vector of 4 elements for every pattern of 4 bits.
// values range from 0 to 3, and will have a multiple of 4 added to them.
alignas(16) static const int LUT[16*4] = { 0,0,0,0, ... };
// mostly C, some pseudocode.
unsigned int bitmap2set(int *set, int input) {
int *set_start = set;
__m128i offset = _mm_setzero_si128();
for (nibble in input[]) { // pseudocode for the actual shifting / masking
__m128i v = _mm_load_si128(&LUT[nibble]);
__m128i vpos = _mm_add_epi32(v, offset);
_mm_store((__m128i*)set, vpos);
set += _mm_popcount_u32(nibble); // variable-length store
offset = _mm_add_epi32(offset, _mm_set1_epi32(4)); // increment the offset by 4
}
return set - set_start; // set size
}
When a nibble isn't 1111, the next store will overlap, but that's fine.
Using popcnt to figure out how much to increment a pointer is a useful technique in general for left-packing variable-length data into a destination array.
i am a beginner in c++ ..so please help me get this right.
trying to read from the collection, in one version of implementation I have tried ,there was some bippings from the console, another test.. displays numbers so its probably the pointer to the string...
the code is as follows
DataCollection.h
typedef struct _DataC
{
char* buffer;
UINT Id;
} DataC;
void GetDataC( int ArrSize, DataC** DArr );
DataCollection.cpp
#include "DataCollection.h"
void GetDataC( int ArrSize, DataC** DArr )
{
int count = 0;
int strSize = 10;
*DArr = (DataC*)CoTaskMemAlloc( ArrSize * sizeof(DataC));
DataC* CurData = *DArr;
char TestS[] = "SomeText00";
for ( int count = 0; count < ArrSize; count++,CurData++ )
{
TestS[strSize-1] = count + '0';
CurData->Id = count;
CurData->buffer = (char*)malloc(sizeof(char)*strSize);
strcpy(CurData->buffer, TestS);
}
}
test the collection:
int main(void)
{
StpWatch Stw;long ResSw;
DataC* TestDataArr;// maybe use DataC TestDataArr[] instead...
GetDataC(100000, &TestDataArr);
}
how can i read the collection within a loop ?
for...
std::cout<<TestDataArr[count].buffer<<std::endl;
or ?
std::cout<<TestDataArr->buffer<<std::endl;
What is the correct implementation to read each element in a loop?
thanks for your time.
DataC* TestDataArr and DataC TestDataArr[] are the same thing. That said, when you try to reference TestDataArr you may do one of two things:
TestDataArr[index].buffer
or
(TestDataArr + index)->buffer
Because TestDataArr is a pointer you must deference it before you may use any of its members, this is what index does. Using the first method, as an array index, the pointer is dereferenced at index in the array and you then may use . to access members of the object. The second method, index advances the pointer to the memory location but does not dereference the pointer, so you must use -> to then access its members.
So to print the buffer in a loop, you could use either:
std::cout << TestDataArr[count].buffer << std::endl;
or
std::cout << (TestDataArr + count)->buffer << std::endl;
The blipping you mention is probably because of TestS[strSize-1] = count + '0'; where count + '0' creates a character outside of the ASCII range. Some of these characters will cause console beeps.
The problem is in TestS[strSize-1] = count + '0';. When you pass ArrSize == 100 0000 then in the for loop the count + '0' value at a some moment exceeds the range of char and you get a char value in the range [0-31] (non-printable characters). At least use
TestS[strSize-1] = '0' + count % (126 - '0');
The last char of TestS will be changed in the range [48-126] (ASCII printable characters).