Custom allocator performance

Custom allocator performance - c++

I'm building an AVL tree class which will have a fixed maximum number of items. So I thought instead of allocating each item by itself, I'd just allocate the entire chunk at once and use a bitmap to assign new memory when needed.
My allocation / deallocation code:
avltree::avltree(UINT64 numitems)
{
root = NULL;
if (!numitems)
buffer = NULL;
else {
UINT64 memsize = sizeof(avlnode) * numitems + bitlist::storagesize(numitems);
buffer = (avlnode *) malloc(memsize);
memmap.init(numitems, buffer + numitems);
memmap.clear_all();
freeaddr = 0;
}
}
avlnode *avltree::newnode(keytype key)
{
if (!buffer)
return new avlnode(key);
else
{
UINT64 pos;
if (freeaddr < memmap.size_bits)
pos = freeaddr++;
else
pos = memmap.get_first_unset();
memmap.set_bit(pos);
return new (&buffer[pos]) avlnode(key);
}
}
void avltree::deletenode(avlnode *node)
{
if (!buffer)
delete node;
else
memmap.clear_bit(node - buffer);
}
In order to use standard new / delete, I have to construct the tree with numitems == 0. In order to use my own allocator, I just pass number of items. All functions are inlined for maximum performance.
This is all fine and dandy, but my own allocator is abut 20% slower than new / delete. Now, I know how complex memory allocators are, there's no way that code can run faster than an array lookup + one bit set, but that is exactly the case here. What's worse: my deallocator is slower even if I remove all code from it?!?
When I check assembly output, my allocator's code path is ridden with QWORD PTR instructions dealing with bitmap, avltree or avlnode. It doesn't seem to be much different for the new / delete path.
For example, assembly output of avltree::newnode:
;avltree::newnode, COMDAT
mov QWORD PTR [rsp+8], rbx
push rdi
sub rsp, 32
;if (!buffer)
cmp QWORD PTR [rcx+8], 0
mov edi, edx
mov rbx, rcx
jne SHORT $LN4#newnode
; return new avlnode(key);
mov ecx, 24
call ??2#YAPEAX_K#Z ; operator new
jmp SHORT $LN27#newnode
;$LN4#newnode:
;else {
; UINT64 pos;
; if (freeaddr < memmap.size_bits)
mov r9, QWORD PTR [rcx+40]
cmp r9, QWORD PTR [rcx+32]
jae SHORT $LN2#newnode
; pos = freeaddr++;
lea rax, QWORD PTR [r9+1]
mov QWORD PTR [rcx+40], rax
; else
jmp SHORT $LN1#newnode
$LN2#newnode:
; pos = memmap.get_first_unset();
add rcx, 16
call ?get_first_unset#bitlist##QEAA_KXZ ; bitlist::get_first_unset
mov r9, rax
$LN1#newnode:
; memmap.set_bit(pos);
mov rcx, QWORD PTR [rbx+16] ;data[bindex(pos)] |= bmask(pos);
mov rdx, r9 ;return pos / (sizeof(BITINT) * 8);
shr rdx, 6
lea r8, QWORD PTR [rcx+rdx*8] ;data[bindex(pos)] |= bmask(pos);
movzx ecx, r9b ;return 1ull << (pos % (sizeof(BITINT) * 8));
mov edx, 1
and cl, 63
shl rdx, cl
; return new (&buffer[pos]) avlnode(key);
lea rcx, QWORD PTR [r9+r9*2]
; File c:\projects\vvd\vvd\util\bitlist.h
or QWORD PTR [r8], rdx ;data[bindex(pos)] |= bmask(pos)
; 195 : return new (&buffer[pos]) avlnode(key);
mov rax, QWORD PTR [rbx+8]
lea rax, QWORD PTR [rax+rcx*8]
; $LN27#newnode:
test rax, rax
je SHORT $LN9#newnode
; avlnode constructor;
mov BYTE PTR [rax+4], 1
mov QWORD PTR [rax+8], 0
mov QWORD PTR [rax+16], 0
mov DWORD PTR [rax], edi
; 196 : }
; 197 : }
; $LN9#newnode:
mov rbx, QWORD PTR [rsp+48]
add rsp, 32 ; 00000020H
pop rdi
ret 0
?newnode#avltree##QEAAPEAUavlnode##H#Z ENDP ; avltree::newnode
_TEXT ENDS
I have checked multiple times the output of compilation when I construct my avltree with default / custom allocator and it remains the same in this particular region of code. I have tried removing / replacing all relevant parts to no significant effect.
To be honest, I expected compiler to inline all of this since there are very few variables. I was hoping for everything except the avlnode objects themselves to be placed in registers, but that doesn't seem to be the case.
Yet the speed difference is clearly measurable. I'm not calling 3 seconds per 10 million nodes inserted slow, but I expected my code to be faster, not slower than generic allocator (2.5 seconds). That goes especially for the slower deallocator which is slower even when all code is stripped from it.
Why is it slower?
Edit:
Thank you all for excellent thoughts on this. But I would like to stress again that the issue is not so much within my method of allocation as it is in suboptimal way of using the variables: the entire avltree class contains just 4 UINT64 variables, bitlist only has 3.
However, despite that, the compiler doesn't optimise this into registers. It insists on QWORD PTR instructions that are orders of magnitude slower. Is this because I'm using classes? Should I move to C / plain variables? Scratch that. Stupid me. I have all the avltree code in there as well, things can't be in registers.
Also, I am at a total loss why my deallocator would still be slower, even if I remove ALL code from it. Yet QueryPerformanceCounter tells me just that. It's insane to even think that: that same deallocator also gets called for the new / delete code path and it has to delete the node... It doesn't have to do anything for my custom allocator (when I strip the code).
Edit2:
I have now completely removed the bitlist and implemented free space tracking through a singly-linked list. The avltree::newnode function is now much more compact (21 instructions for my custom allocator path, 7 of those are QWORD PTR ops dealing with avltree and 4 are used for constructor of avlnode).
The end result (time) decreased from ~3 seconds to ~2.95 seconds for 10 million allocations.
Edit3:
I also rewrote the entire code such that now everything is handled by the singly linked list. Now the avltree class only has two relevant members: root and first_free. The speed difference remains.
Edit4:
Rearranging code and looking at performance figures, these things are what helped the most:
As pointed out by all contributors, having a bitmap in there was just plain bad. Removed in favour of singly-linked free slot list.
Code locality: by adding dependent functions (avl tree handling ones) into a function-local class instead of having them declared globally helped some 15% with code speed (3 secs --> 2.5 secs)
avlnode struct size: just adding #pragma pack(1) before struct declaration decreased execution time a further 20% (2,5 secs --> 2 secs)
Edit 5:
Since this querstion seems to have been quite popular, I have posted the final complete code as an answer below. I'm quite satisfied with its performance.

Your method only allocates the raw memory in one chunk and then has to do a placement new for each element. Combine that with all the overhead in your bitmap and its not too surprising that the default new allocation beats yours assuming an empty heap.
To get the most speed improvement when allocating you can allocate the entire object in one large array and then assign to it from there. If you look at a very simple and contrived benchmark:
struct test_t {
float f;
int i;
test_t* pNext;
};
const size_t NUM_ALLOCS = 50000000;
void TestNew (void)
{
test_t* pPtr = new test_t;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = new test_t;
pPtr = pPtr->pNext;
}
}
void TestBucket (void)
{
test_t* pBuckets = new test_t[NUM_ALLOCS + 2];
test_t* pPtr = pBuckets++;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = pBuckets++;
pPtr = pPtr->pNext;
}
}
With this code on MSVC++ 2013 with 50M allocations TestBucket() outperforms TestNew() by over a factor of x16 (130 vs 2080 ms). Even if you add a std::bitset<> to track allocations it is still x4 faster (400 ms).
An important thing to remember about new is that the time its takes to allocate an object generally depends on the state of the heap. An empty heap will be able to allocate a bunch of constant sized objects like this relatively fast, which is probably one reason why your code seems slower than new. If you have a program that runs for a while and allocates a large number of differently sized objects the heap can become fragmented and allocating objects can take much (much) longer.
As an example, one program I wrote was loading a 200MB file with millions of records...lots of differently sized allocations. On the first load it took ~15 seconds but if I deleted that file and tried to load it again it took something like x10-x20 longer. This was entirely due to memory allocation and switching to a simple bucket/arena allocator fixed the issue. So, that contrived benchmark I did showing a x16 speedup might actually get show a significantly larger difference with a fragmented heap.
It gets even more tricky when you realize that different systems/platforms may use different memory allocation schemes so the benchmark results on one system may be different from another.
To distil this into a few short points:
Benchmarking memory allocation is tricky (performance depends on a lot of things)
In some cases you can get better performance with a custom allocator. In a few cases you can get much better.
Creating a custom allocator can be tricky and takes time to profile/benchmark your specific use case.
Note -- Benchmarks like this aren't meant to be realistic but are useful to determine the upper bound of how fast something can be. It can be used along with the profile/benchmark of your actual code to help determine what should/shouldn't be optimized.
Update -- I can't seem to replicate your results in my code under MSVC++ 2013. Using the same structure as your avlnode and trying a placement new yields the same speed as my non-placement bucket allocator tests (placement new was actually a little bit faster). Using a class similar to your avltree doesn't affect the benchmark much. With 10 million allocations/deallocations I'm getting ~800 ms for the new/delete and ~200ms for the custom allocator (both with and without placement new). While I'm not worried about the difference in absolute times, the relative time difference seems odd.
I would suggest taking a closer look at your benchmark and make sure you are measuring what you think you are. If the code exists in a larger code-base then create a minimal test case to benchmark it. Make sure that your compiler optimizer is not doing something that would invalidate the benchmark (it happens too easily these days).
Note that it would be far easier to answer your question if you had reduced it to a minimal example and included the complete code in the question, including the benchmark code. Benchmarking is one of those things that seems easy but there are a lot of "gotchas" involved in it.
Update 2 -- Including the basic allocator class and benchmark code I'm using so others can try to duplicate my results. Note that this is for testing only and is far from actual working/production code. It is far simpler than your code which may be why we're getting different results.
#include <string>
#include <Windows.h>
struct test_t
{
__int64 key;
__int64 weight;
__int64 left;
__int64 right;
test_t* pNext; // Simple linked list
test_t() : key(0), weight(0), pNext(NULL), left(0), right(0) { }
test_t(const __int64 k) : key(k), weight(0), pNext(NULL), left(0), right(0) { }
};
const size_t NUM_ALLOCS = 10000000;
test_t* pLast; //To prevent compiler optimizations from being "smart"
struct CTest
{
test_t* m_pBuffer;
size_t m_MaxSize;
size_t m_FreeIndex;
test_t* m_pFreeList;
CTest(const size_t Size) :
m_pBuffer(NULL),
m_MaxSize(Size),
m_pFreeList(NULL),
m_FreeIndex(0)
{
if (m_MaxSize > 0) m_pBuffer = (test_t *) new char[sizeof(test_t) * (m_MaxSize + 1)];
}
test_t* NewNode(__int64 key)
{
if (!m_pBuffer || m_FreeIndex >= m_MaxSize) return new test_t(key);
size_t Pos = m_FreeIndex;
++m_FreeIndex;
return new (&m_pBuffer[Pos]) test_t(key);
}
void DeleteNode(test_t* pNode)
{
if (!m_pBuffer) {
delete pNode;
}
else
{
pNode->pNext = m_pFreeList;
m_pFreeList = pNode;
}
}
};
void TestNew(void)
{
test_t* pPtr = new test_t;
test_t* pFirst = pPtr;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = new test_t;
pPtr = pPtr->pNext;
}
pPtr = pFirst;
while (pPtr)
{
test_t* pTemp = pPtr;
pPtr = pPtr->pNext;
delete pTemp;
}
pLast = pPtr;
}
void TestClass(const size_t BufferSize)
{
CTest Alloc(BufferSize);
test_t* pPtr = Alloc.NewNode(0);
test_t* pFirstPtr = pPtr;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = Alloc.NewNode(i);
pPtr = pPtr->pNext;
}
pLast = pPtr;
pPtr = pFirstPtr;
while (pPtr != NULL)
{
test_t* pTmp = pPtr->pNext;
Alloc.DeleteNode(pPtr);
pPtr = pTmp;
}
}
int main(void)
{
DWORD StartTick = GetTickCount();
TestClass(0);
//TestClass(NUM_ALLOCS + 10);
//TestNew();
DWORD EndTick = GetTickCount();
printf("Time = %u ms\n", EndTick - StartTick);
printf("Last = %p\n", pLast);
return 0;
}
Currently I'm getting ~800ms for both TestNew() and TestClass(0) and under 200ms for TestClass(NUM_ALLOCS + 10). The custom allocator is pretty fast as it operates on the memory in a completely linear fashion which allows the memory cache to work its magic. I'm also using GetTickCount() for simplicity and it is accurate enough so long as times are above ~100ms.

It's hard to be certain with such little code to study, but I'm betting on locality of reference. Your bitmap with metadata is not on the same cacheline as the allocated memory itself. And get_first_unset might be a linear search.

Now, I know how complex memory allocators are, there's no way that code can run faster than an array lookup + one bit set, but that is exactly the case here.
This isn't even nearly correct. A decent bucketing low fragmentation heap is O(1) with very low constant time (and effectively zero additional space overhead). I've seen a version that came down to ~18 asm instructions (with one branch) before. Thats a lot less than your code. Remember, heaps may be massively complex in total, but the fast path through them may be really, really fast.

Just for reference, the following code was the most performant for the problem at hand.
It's just a simple avltree implementation, but it does reach 1,7 secs for 10 million inserts and 1,4 secs for equal number of deletes on my 2600K # 4.6 GHz.
#include "stdafx.h"
#include <iostream>
#include <crtdbg.h>
#include <Windows.h>
#include <malloc.h>
#include <new>
#ifndef NULL
#define NULL 0
#endif
typedef int keytype;
typedef unsigned long long UINT64;
struct avlnode;
struct avltree
{
avlnode *root;
avlnode *buffer;
avlnode *firstfree;
avltree() : avltree(0) {};
avltree(UINT64 numitems);
inline avlnode *newnode(keytype key);
inline void deletenode(avlnode *node);
void insert(keytype key) { root = insert(root, key); }
void remove(keytype key) { root = remove(root, key); }
int height();
bool hasitems() { return root != NULL; }
private:
avlnode *insert(avlnode *node, keytype k);
avlnode *remove(avlnode *node, keytype k);
};
#pragma pack(1)
struct avlnode
{
avlnode *left; //left pointer
avlnode *right; //right pointer
keytype key; //node key
unsigned char hgt; //height of the node
avlnode(int k)
{
key = k;
left = right = NULL;
hgt = 1;
}
avlnode &balance()
{
struct F
{
unsigned char height(avlnode &node)
{
return &node ? node.hgt : 0;
}
int balance(avlnode &node)
{
return &node ? height(*node.right) - height(*node.left) : 0;
}
int fixheight(avlnode &node)
{
unsigned char hl = height(*node.left);
unsigned char hr = height(*node.right);
node.hgt = (hl > hr ? hl : hr) + 1;
return (&node) ? hr - hl : 0;
}
avlnode &rotateleft(avlnode &node)
{
avlnode &p = *node.right;
node.right = p.left;
p.left = &node;
fixheight(node);
fixheight(p);
return p;
}
avlnode &rotateright(avlnode &node)
{
avlnode &q = *node.left;
node.left = q.right;
q.right = &node;
fixheight(node);
fixheight(q);
return q;
}
avlnode &b(avlnode &node)
{
int bal = fixheight(node);
if (bal == 2) {
if (balance(*node.right) < 0)
node.right = &rotateright(*node.right);
return rotateleft(node);
}
if (bal == -2) {
if (balance(*node.left) > 0)
node.left = &rotateleft(*node.left);
return rotateright(node);
}
return node; // balancing is not required
}
} f;
return f.b(*this);
}
};
avltree::avltree(UINT64 numitems)
{
root = buffer = firstfree = NULL;
if (numitems) {
buffer = (avlnode *) malloc(sizeof(avlnode) * (numitems + 1));
avlnode *tmp = &buffer[numitems];
while (tmp > buffer) {
tmp->right = firstfree;
firstfree = tmp--;
}
}
}
avlnode *avltree::newnode(keytype key)
{
avlnode *node = firstfree;
/*
If you want to support dynamic allocation, uncomment this.
It does present a bit of an overhead for bucket allocation though (8% slower)
Also, if a condition is met where bucket is too small, new nodes will be dynamically allocated, but never freed
if (!node)
return new avlnode(key);
*/
firstfree = firstfree->right;
return new (node) avlnode(key);
}
void avltree::deletenode(avlnode *node)
{
/*
If you want to support dynamic allocation, uncomment this.
if (!buffer)
delete node;
else {
*/
node->right = firstfree;
firstfree = node;
}
int avltree::height()
{
return root ? root->hgt : 0;
}
avlnode *avltree::insert(avlnode *node, keytype k)
{
if (!node)
return newnode(k);
if (k == node->key)
return node;
else if (k < node->key)
node->left = insert(node->left, k);
else
node->right = insert(node->right, k);
return &node->balance();
}
avlnode *avltree::remove(avlnode *node, keytype k) // deleting k key from p tree
{
if (!node)
return NULL;
if (k < node->key)
node->left = remove(node->left, k);
else if (k > node->key)
node->right = remove(node->right, k);
else // k == p->key
{
avlnode *l = node->left;
avlnode *r = node->right;
deletenode(node);
if (!r) return l;
struct F
{
//findmin finds the minimum node
avlnode &findmin(avlnode *node)
{
return node->left ? findmin(node->left) : *node;
}
//removemin removes the minimum node
avlnode &removemin(avlnode &node)
{
if (!node.left)
return *node.right;
node.left = &removemin(*node.left);
return node.balance();
}
} f;
avlnode &min = f.findmin(r);
min.right = &f.removemin(*r);
min.left = l;
return &min.balance();
}
return &node->balance();
}
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
// 64 bit release performance (for 10.000.000 nodes)
// malloc: insertion: 2,595 deletion 1,865
// my allocator: insertion: 2,980 deletion 2,270
const int nodescount = 10000000;
avltree &tree = avltree(nodescount);
cout << "sizeof avlnode " << sizeof(avlnode) << endl;
cout << "inserting " << nodescount << " nodes" << endl;
LARGE_INTEGER t1, t2, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&t1);
for (int i = 1; i <= nodescount; i++)
tree.insert(i);
QueryPerformanceCounter(&t2);
cout << "Tree height " << (int) tree.height() << endl;
cout << "Insertion time: " << ((double) t2.QuadPart - t1.QuadPart) / freq.QuadPart << " s" << endl;
QueryPerformanceCounter(&t1);
while (tree.hasitems())
tree.remove(tree.root->key);
QueryPerformanceCounter(&t2);
cout << "Deletion time: " << ((double) t2.QuadPart - t1.QuadPart) / freq.QuadPart << " s" << endl;
#ifdef _DEBUG
_CrtMemState mem;
_CrtMemCheckpoint(&mem);
cout << "Memory used: " << mem.lTotalCount << " high: " << mem.lHighWaterCount << endl;
#endif
return 0;
}

Related

illegal instruction occur while using pointer and reference

when reading the source codes of realtime_tools::RealtimeBuffer, I got lots of questions about the pointer and reference. The related codes are shown below:
void writeFromNonRT(const T& data)
{
// get lock
lock();
// copy data into non-realtime buffer
*non_realtime_data_ = data;
new_data_available_ = true;
// release lock
mutex_.unlock();
}
To figure out, I tried to code the similar code like:
#include <iostream>
using namespace std;
void pt_ref(int& data)
{
int *ptr;
ptr = &data; // ptr points to "data"
cout << "data's addres： "<< ptr <<"\n"; // print address
}
int main()
{
int x = 3;
pt_ref(x);
cout << "x's address： " << &x;
}
\\ output:
\\ data's addres： 0x7ffe05c17c4c
\\ x's address： 0x7ffe05c17c4c
This code runs well, but it's still different to the source code.
// these 2 lines are different.
*non_realtime_data_ = data;
ptr = &data;
So I tried to change ptr = &data; to *ptr = data;, and ran again the code, the error("illegal instruction") occurred.
Hope someone can answer me, thanks a lot.
PS: I ran the code on the replit online compiler.

I tried to change ptr = &data; to *ptr = data;, and ran again the code, the error("illegal instruction") occurred.
The problem is that the the pointer ptr was uninitialized(and does not point to any int object) and so dereferencing that pointer(which you did when you wrote *ptr on the left hand side) leads to undefined behavior.
int *ptr; //pointer ptr does not point to any int object as of now
*ptr = data;
//-^^^^--------->undefined behavior since ptr doesn't point to any int object
To solve this make sure that before dereferencing ptr, the pointer ptr points to some int object.
void pt_ref(int& data)
{
int var = 10; //int object
//-------------vvvv-------->now ptr points to "var"
int *ptr = &var;
//--vvvv------------------->this is fine now
*ptr = data;
}

Implementation of stack in C++ without using <stack>

I want to make an implementation of stack, I found a working model on the internet, unfortunately it is based on the idea that I know the size of the stack I want to implement right away. What I want to do is be able to add segments to my stack as they are needed, because potential maximum amount of the slots required goes into 10s of thousands and from my understanding making the size set in stone (when all of it is not needed most of the time) is a huge waste of memory and loss of the execution speed of the program. I also do not want to use any complex prewritten functions in my implementation (the functions provided by STL or different libraries such as vector etc.) as I want to understand all of them more by trying to make them myself/with brief help.
struct variabl {
char *given_name;
double value;
};
variabl* variables[50000];
int c = 0;
int end_of_stack = 0;
class Stack
{
private:
int top, length;
char *z;
int index_struc = 0;
public:
Stack(int = 0);
~Stack();
char pop();
void push();
};
Stack::Stack(int size) /*
This is where the problem begins, I want to be able to allocate the size
dynamically.
*/
{
top = -1;
length = size;
z = new char[length];
}
void Stack::push()
{
++top;
z[top] = variables[index_struc]->value;
index_struc++;
}
char Stack::pop()
{
end_of_stack = 0;
if (z == 0 || top == -1)
{
end_of_stack = 1;
return NULL;
}
char top_stack = z[top];
top--;
length--;
return top_stack;
}
Stack::~Stack()
{
delete[] z;
}
I had somewhat of a idea, and tried doing
Stack stackk
//whenever I want to put another thing into stack
stackk.push = new char;
but then I didnt completely understand how will it work for my purpose, I don't think it will be fully accessible with the pop method etc because it will be a set of separate arrays/variables right? I want the implementation to remain reasonably simple so I can understand it.

Change your push function to take a parameter, rather than needing to reference variables.
To handle pushes, start with an initial length of your array z (and change z to a better variable name). When you are pushing a new value, check if the new value will mean that the size of your array is too small (by comparing length and top). If it will exceed the current size, allocate a bigger array and copy the values from z to the new array, free up z, and make z point to the new array.

Here you have a simple implementation without the need of reallocating arrays. It uses the auxiliary class Node, that holds a value, and a pointer to another Node (that is set to NULL to indicate the end of the stack).
main() tests the stack by reading commands of the form
p c: push c to the stack
g: print top of stack and pop
#include <cstdlib>
#include <iostream>
using namespace std;
class Node {
private:
char c;
Node *next;
public:
Node(char cc, Node *nnext){
c = cc;
next = nnext;
}
char getChar(){
return c;
}
Node *getNext(){
return next;
}
~Node(){}
};
class Stack {
private:
Node *start;
public:
Stack(){
start = NULL;
}
void push(char c){
start = new Node(c, start);
}
char pop(){
if(start == NULL){
//Handle error
cerr << "pop on empty stack" << endl;
exit(1);
}
else {
char r = (*start).getChar();
Node* newstart = (*start).getNext();
delete start;
start = newstart;
return r;
}
}
bool empty(){
return start == NULL;
}
};
int main(){
char c, k;
Stack st;
while(cin>>c){
switch(c){
case 'p':
cin >> k;
st.push(k);
break;
case 'g':
cout << st.pop()<<endl;
break;
}
}
return 0;
}

How to save the registers value using inline Assembler

(Well,this is the first time i ask questions here and English isn't my first language,so please forgive some of my mistakes. And i'm a green hand in programme.)
I met this problem while doing my OS homework, we were asked to simulate the function SwitchToFiber,and my current problem is i don't know how to save the registers value in order to recover the function next time it was called.
I don't know if my problem was clear. Though i don't think my code was useful, i will put them below.
#include <stdio.h>
#define INVALID_THD NULL
#define N 5
#define REG_NUM 32
unsigned store[N][REG_NUM];
typedef struct
{
void (*address) (void * arg);
void* argu;
}thread_s;
thread_s ts[N];
void StartThds();
void YieldThd();
void *CreateThd(void (*ThdFunc)(void*), void * arg);
void thd1(void * arg);
void thd2(void * arg);
void StartThds()
{
}
void YieldThd()
{
thd2((void*)2);
}
void *CreateThd(void (*ThdFunc)(void*), void * arg)
{
ts[(int)arg].address = (*ThdFunc);
ts[(int)arg].argu = arg;
}
void thd2(void * arg)
{
for (int i = 4; i < 12; i++)
{
printf("\tthd2: arg=%d , i = %d\n", (int)arg, i);
//in order to see clearly，i added /t abouve
YieldThd();
}
}
void thd1(void * arg)
{
/*
__asm__(
);
*/
for (int i = 0; i < 12; i++)
{
printf("thd1: arg=%d , i = %d\n", (int)arg, i);
YieldThd();
}
}
int main()
{
//this is my first plan, to store the register value in some static arry
for(int i = 0; i<N; i++)
for(int j = 0; j<REG_NUM; j++)
store[i][j] = 0;
//create the two thread
if (CreateThd(thd1, (void *)1) == INVALID_THD)
{
printf("cannot create\n");
}
if (CreateThd(thd2, (void *)2) == INVALID_THD)
{
printf("cannot create\n");
}
ts[1].address(ts[1].argu); //thd1((void*)1),argu = 1；
// StartThds();
return 0;
}
This is the whole code now i have,because i don't know which part maybe useful, so i put them all above. As you can see, most of them are still empty.

It's possible (as pointed out in comments) that you don't need to write assembly for this, perhaps you can get away with just using setjmp()/longjmp(), and have them do the necessary state-saving.

I have done this before but i always have to look up the details. The following is of course just pseudo-code.
basically what you do is you create a struct with your registers:
typedef struct regs {
int ebx; //make sure these have the right size for the processors.
int ecx;
//... for all registers you wish to backup
} registers;
//when changing from one thread
asm( //assembly varies from compiler to compiler check your manual
"mov ebx, thread1.register.ebx;
mov ecx, thread1.register.ecx;"
// and so on
//very important store the current program counter to the return address of this fu nction so we can continue from ther
// you must know where the return address is stored
"mov return address, thread1.register.ret"
);
//restore the other threads registers
asm(
"mov thread2.register.ebx, ebx;
mov thread2.register.ecx, ecx;
//now restoer the pc and let it run
mov thread2.register.ret, pc; //this will continue from where we stopped before
);
This is more or less the principle of how it works. Since you are learning this you should be able to figure out the rest on your own.

How to access the addresses after i get out of the loop?

#include<iostream>
using namespace std;
struct data {
int x;
data *ptr;
};
int main() {
int i = 0;
while( i >=3 ) {
data *pointer = new data; // pointer points to the address of data
pointer->ptr = pointer; // ptr contains the address of pointer
i++;
}
system("pause");
}
Let us assume after iterating 3 times :
ptr had address = 100 after first loop
ptr had address = 200 after second loop
ptr had address = 300 after third loop
Now the questions are :
Do all the three addresses that were being assigned to ptr exist in the memory after the program gets out of the loop ?
If yes , what is the method to access these addresses after i get out of the loop ?

Well the memory is reserved but you have no pointer to the memory so that's whats called a memory leak (reserved memory but no way to get to it). You may want to have an array of data* to save these pointers so you can delete them when you are done with them or use them later.

For starters, there will be no memory allocated for any ptr with the code you have.
int i = 0;
while( i >= 3)
This will not enter the while loop at all.
However, if you are looking to access the ptr contained inside the struct then you can try this. I am not sure what you are trying to achieve by assigning the ptr with its own struct object address. The program below will print the value of x and the address assigned to ptr.
#include<iostream>
using namespace std;
struct data {
int x;
data *ptr;
};
int main() {
int i = 0;
data pointer[4];
while( i <=3 ) {
pointer[i].x = i;
pointer[i].ptr = &pointer[i];
i++;
}
for( int i = 0; i <= 3; i++ )
{
cout<< pointer[i].x << endl;
cout<< pointer[i].ptr << endl;
}
}
OUTPUT:
0
0xbf834e98
1
0xbf834ea0
2
0xbf834ea8
3
0xbf834eb0
Personally, when I know the number of iterations I want to do, I choose for loops and I use while only when I am looking to iterate unknown number of times before a logical expression is satisfied.

I cannot guess what you are trying to achieve...
But Me thinks, you are trying to achieve similar to this....
But, If you want to make linked list using your implementation, you can try this...
#include<iostream.h>
struct data {
int x;
data *ptr;
data()
{
x = -1;
ptr = NULL;
}
};
data *head = new data();
data *pointer = head;
int main() {
int i = 0;
while( i <=3 ) {
data *pointer = new data();
pointer->x = /*YOUR DATA*/;
::pointer->ptr = pointer;
::pointer = pointer;
i++;
}
i=0;
data* pointer = head->next;
while( i <=3 ) {
cout<<pointer->x;
pointer = pointer->ptr;
i++;
}
system("pause");
}
This will print , the elements in the linked list;

Memory leak in trivial stack implementation

I'm decently experienced with Python and Java, but I recently decided to learn C++. I decided to make a quick integer stack implementation, but it has a massive memory leak that I can't understand. When I pop the node, it doesn't seem to be releasing the memory even though I explicitly delete the old node upon poping it. When I run it, it uses 150mb of memory, but doesn't release any of it after I empty the stack. I would appreciate any help since this is my first foray into a language without garbage collection. This was compiled with gcc 4.3 on 64-bit Kubuntu.
//a trivial linked list based stack of integers
#include <iostream>
using namespace std;
class Node
{
private:
int num;
Node * next;
public:
Node(int data, Node * next);
int getData();
Node * getNext();
};
Node::Node(int data, Node * next_node)
{
num = data;
next = next_node;
}
inline int Node::getData()
{
return num;
}
inline Node* Node::getNext()
{
return next;
}
class Stack
{
private:
unsigned long int n;
Node * top;
public:
Stack(int first);
Stack();
void push(int data);
int pop();
int peek();
unsigned long int getSize();
void print();
void empty();
};
Stack::Stack(int first)
{
Node first_top (first, NULL);
top = &first_top;
n = 1;
}
Stack::Stack()
{
top = NULL;
n = 0;
}
void Stack::push(int data)
{
Node* old_top = top;
Node* new_top = new Node(data,old_top);
top = new_top;
n++;
}
int Stack::pop()
{
Node* old_top = top;
int ret_num = old_top->getData();
top = old_top->getNext();
delete old_top;
n--;
return ret_num;
}
inline int Stack::peek()
{
return top->getData();
}
inline unsigned long int Stack::getSize()
{
return n;
}
void Stack::print()
{
Node* current = top;
cout << "Stack: [";
for(unsigned long int i = 0; i<n-1; i++)
{
cout << current->getData() << ", ";
current = current->getNext();
}
cout << current->getData() << "]" << endl;
}
void Stack::empty()
{
unsigned long int upper = n;
for(unsigned long int i = 0; i<upper; i++)
{
this->pop();
}
}
Stack createStackRange(int start, int end, int step = 1)
{
Stack stack = Stack();
for(int i = start; i <= end; i+=step)
{
stack.push(i);
}
return stack;
}
int main()
{
Stack s = createStackRange(0,5e6);
cout << s.peek() << endl;
sleep(1);
cout << "emptying" <<endl;
s.empty();
cout << "emptied" <<endl;
cout << "The size of the stack is " << s.getSize()<<endl;
cout << "waiting..." << endl;
sleep(10);
return 0;
}

How do you KNOW the memory isn't being released? The runtime library will manage allocations and may not release the memory back to the OS until the program terminates. If that's the case, the memory will be available for other allocations within your program during its execution.
However.... you seem to have other problems. My C++ is really rusty since I've been doing Java for 15 years, but in your Stack::Stack constructor you're allocating a Node instance on the system stack and then storing a reference to it in your "Stack". That Node instance goes out of scope when the constructor ends, leaving a dangling pointer.

Stack::Stack(int first)
{
Node first_top (first, NULL);
top = &first_top;
n = 1;
}
This is wrong , you cant assign address of a local object to class member( top ) , since local objects get destroyed when function returns.
Create a node on heap rather than stack , do something like this :
Stack::Stack(int first)
{
top = new Node(first, NULL);
n = 1;
}
And Make the concept of link list clear and use pen and paper if you can do so.
Your Stack::Push(int) operation seems buggy check it out what you have forget to do.
My suggestion is try to implement generic stack with the help of template ,so it will work for all data type .

When createStackRange() returns it'll return a copy of the Stack using the compiler-generated copy constructor which just makes a bitwise copy (i.e., it'll copy the pointer to the first node and the size.)
More seriously, you're missing the destructor for the Stack class. Ideally you'd have it walk the list and call delete on each Node. The Stack object created on the processor stack will automatically be cleaned up automatically when main() exits, but without a destructor, the nodes will still be allocated when the program ends. You probably want something like this for it:
Stack::~Stack()
{
while ( top )
{
Next *next = top->getNext();
delete top;
top = next;
}
}
The way to think of it is that the C++ compiler will automatically generate copy constructors and destructors for you, but they're normally shallow. If you need deep behavior you've got to do it implement it yourself somewhere.

After poring over the code, I couldn't find the leak so I compiled it and ran it in a debugger myself. I agree with Jim Garrision - I think you're seeing an artifact of the runtime rather than an actual leak, because I'm not seeing it on my side. The issues pointed out by NickLarsen and smith are both actual issues that you want to correct, but if you trace the code through, neither should actually be causing the problem you describe. The code smith singles out is never called in your example, and the code Nick singles out would cause other issues, but not the one you're seeing.

Creat a stub to test your code and user Memory Analysis tool like "Valgrind". This will find out memory leaks and corruptions for you.
check man-pages for more information.

Note that you should only roll your own stack for educational purposes. For any real code, you should use the stack implementation that comes with the C++ standard library...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Custom allocator performance - c++

It's hard to be certain with such little code to study, but I'm betting on locality of reference. Your bitmap with metadata is not on the same cacheline as the allocated memory itself. And get_first_unset might be a linear search.

Related

illegal instruction occur while using pointer and reference

Implementation of stack in C++ without using <stack>

How to save the registers value using inline Assembler

How to access the addresses after i get out of the loop?

Memory leak in trivial stack implementation

Categories

Resources