How to save the registers value using inline Assembler

How to save the registers value using inline Assembler - c++

(Well,this is the first time i ask questions here and English isn't my first language,so please forgive some of my mistakes. And i'm a green hand in programme.)
I met this problem while doing my OS homework, we were asked to simulate the function SwitchToFiber,and my current problem is i don't know how to save the registers value in order to recover the function next time it was called.
I don't know if my problem was clear. Though i don't think my code was useful, i will put them below.
#include <stdio.h>
#define INVALID_THD NULL
#define N 5
#define REG_NUM 32
unsigned store[N][REG_NUM];
typedef struct
{
void (*address) (void * arg);
void* argu;
}thread_s;
thread_s ts[N];
void StartThds();
void YieldThd();
void *CreateThd(void (*ThdFunc)(void*), void * arg);
void thd1(void * arg);
void thd2(void * arg);
void StartThds()
{
}
void YieldThd()
{
thd2((void*)2);
}
void *CreateThd(void (*ThdFunc)(void*), void * arg)
{
ts[(int)arg].address = (*ThdFunc);
ts[(int)arg].argu = arg;
}
void thd2(void * arg)
{
for (int i = 4; i < 12; i++)
{
printf("\tthd2: arg=%d , i = %d\n", (int)arg, i);
//in order to see clearly，i added /t abouve
YieldThd();
}
}
void thd1(void * arg)
{
/*
__asm__(
);
*/
for (int i = 0; i < 12; i++)
{
printf("thd1: arg=%d , i = %d\n", (int)arg, i);
YieldThd();
}
}
int main()
{
//this is my first plan, to store the register value in some static arry
for(int i = 0; i<N; i++)
for(int j = 0; j<REG_NUM; j++)
store[i][j] = 0;
//create the two thread
if (CreateThd(thd1, (void *)1) == INVALID_THD)
{
printf("cannot create\n");
}
if (CreateThd(thd2, (void *)2) == INVALID_THD)
{
printf("cannot create\n");
}
ts[1].address(ts[1].argu); //thd1((void*)1),argu = 1；
// StartThds();
return 0;
}
This is the whole code now i have,because i don't know which part maybe useful, so i put them all above. As you can see, most of them are still empty.

It's possible (as pointed out in comments) that you don't need to write assembly for this, perhaps you can get away with just using setjmp()/longjmp(), and have them do the necessary state-saving.

I have done this before but i always have to look up the details. The following is of course just pseudo-code.
basically what you do is you create a struct with your registers:
typedef struct regs {
int ebx; //make sure these have the right size for the processors.
int ecx;
//... for all registers you wish to backup
} registers;
//when changing from one thread
asm( //assembly varies from compiler to compiler check your manual
"mov ebx, thread1.register.ebx;
mov ecx, thread1.register.ecx;"
// and so on
//very important store the current program counter to the return address of this fu nction so we can continue from ther
// you must know where the return address is stored
"mov return address, thread1.register.ret"
);
//restore the other threads registers
asm(
"mov thread2.register.ebx, ebx;
mov thread2.register.ecx, ecx;
//now restoer the pc and let it run
mov thread2.register.ret, pc; //this will continue from where we stopped before
);
This is more or less the principle of how it works. Since you are learning this you should be able to figure out the rest on your own.

Related

Modifying integer value from another process using process_vm_readv

I am using Ubuntu Linux to write two programs. I am attempting to change the value of an integer from another process. My first process (A) is a simple program that loops forever and displays the value to the screen. This program works as intended and simply displays the value -1430532899 (0xAABBCCDD) to the screen.
#include <stdio.h>
int main()
{
//The needle that I am looking for to change from another process
int x = 0xAABBCCDD;
//Loop forever printing out the value of x
int counter = 0;
while(1==1)
{
while(counter<100000000)
{
counter++;
}
counter = 0;
printf("%d",x);
fflush(stdout);
}
return 0;
}
In a separate terminal, I use the ps -e command to list the processes and note the process id for process (A). Next as root use (sudo) I run this next program (B) and enter in the process ID that I noted from process (A).
The program basically searches for the needle which is in memory backwards (DD CC BB AA) find the needle, and takes note of the address. It then goes and tries to write the hex value (0xEEEEEEEE) to that same location, but I get a bad address error when errno is set to 14. The strange thing is a little later in the address space, I am able to write the values successfully to the address (0x601000) but the address where the needle(0xAABBCCDD) is at 0x6005DF I cannot write there. (But can read obviously because that is where I found the needle)
#include <stdio.h>
#include <iostream>
#include <sys/uio.h>
#include <string>
#include <errno.h>
#include <vector>
using namespace std;
char getHex(char value);
string printHex(unsigned char* buffer, int length);
int getProcessId();
int main()
{
//Get the process ID of the process we want to read and write
int pid = getProcessId();
//Lists of addresses where we find our needle 0xAABBCCDD and the addresses where we simply cannot read
vector<long> needleAddresses;
vector<long> unableToReadAddresses;
unsigned char buf1[1000]; //buffer used to store memory values read from other process
//Number of bytes read, also is -1 if an error has occurred
ssize_t nread;
//Structures used in the process_vm_readv system call
struct iovec local[1];
struct iovec remote[1];
local[0].iov_base = buf1;
local[0].iov_len = 1000;
remote[0].iov_base = (void * ) 0x00000; //start at address 0 and work up
remote[0].iov_len = 1000;
for(int i=0;i<10000;i++)
{
nread = process_vm_readv(pid, local, 1, remote, 1 ,0);
if(nread == -1)
{
//errno is 14 then the problem is "bad address"
if(errno == 14)
unableToReadAddresses.push_back((long)remote[0].iov_base);
}
else
{
cout<<printHex(buf1,local[0].iov_len);
for(int j=0;j<1000-3;j++)
{
if(buf1[j] == 0xDD && buf1[j+1] == 0xCC && buf1[j+2] == 0xBB && buf1[j+3] == 0xAA)
{
needleAddresses.push_back((long)(remote[0].iov_base+j));
}
}
}
remote[0].iov_base += 1000;
}
cout<<"Addresses found at...";
for(int i=0;i<needleAddresses.size();i++)
{
cout<<needleAddresses[i]<<endl;
}
//How many bytes written
int nwrite = 0;
struct iovec local2[1];
struct iovec remote2[1];
unsigned char data[] = {0xEE,0xEE,0xEE,0xEE};
local2[0].iov_base = data;
local2[0].iov_len = 4;
remote2[0].iov_base = (void*)0x601000;
remote2[0].iov_len = 4;
for(int i=0;i<needleAddresses.size();i++)
{
cout<<"Attempting to write "<<printHex(data,4)<<" to address "<<needleAddresses[i]<<endl;
remote2[0].iov_base = (void*)needleAddresses[i];
nwrite = process_vm_writev(pid,local2,1,remote2,1,0);
if(nwrite == -1)
{
cout<<"Error writing to "<<needleAddresses[i]<<endl;
}
else
{
cout<<"Successfully wrote data";
}
}
//For some reason THIS will work
remote2[0].iov_base = (void*)0x601000;
nwrite = process_vm_writev(pid,local2,1,remote2,1,0);
cout<<"Wrote "<<nwrite<<" Bytes to the address "<<0x601000 <<" "<<errno;
return 0;
}
string printHex(unsigned char* buffer, int length)
{
string retval;
char temp;
for(int i=0;i<length;i++)
{
temp = buffer[i];
temp = temp>>4;
temp = temp & 0x0F;
retval += getHex(temp);
temp = buffer[i];
temp = temp & 0x0F;
retval += getHex(temp);
retval += ' ';
}
return retval;
}
char getHex(char value)
{
if(value < 10)
{
return value+'0';
}
else
{
value = value - 10;
return value+'A';
}
}
int getProcessId()
{
int data = 0;
printf("Please enter the process id...");
scanf("%d",&data);
return data;
}
Bottom line is that I cannot modify the repeating printed integer from another process.

I can see at least these problems.
No one guarantees there's 0xAABBCCDD anywhere in the writable memory of the process. The compiler can optimize it away entirely, or put in in a register. One way to enssure a variable will be placed in the main memory is to declare it volatile.
volatile int x = 0xAABBCCDDEE;
No one guarantees there's no 0xAABBCCDD somewhere in the read-only memory of the process. On the contrary, one could be quite certain there is in fact such a value there. Where else could the program possibly obtain it to initialise the variable? The initialisation probably translates to an assembly instruction similar to this
mov eax, 0xAABBCCDD
which, unsurprisingly, contains a bit pattern that matches 0xAABBCCDD. The address 0x6005DF could well be in the .text section. It is extremely unlikely it is on the stack, because stack addresses are typically close to the top of the address space.
The address space of a 64-bit process is huge. There is no hope to traverse it all in a reasonable amount of time. One needs to limit the range of addresses somehow.

Are compiler optimization solving thread safety issues?

I'm writing a C++ multi-threaded code. When testing the overhead of different mutex lock I found that the thread unsafe code seem to yield the correct result compiled with Release Configuration in Visual Studio but much faster than the code with mutex lock. However with Debug Configuration the result is what I expected. I was wondering if it's the compiler that solved this or it's just because the code compiled in Release configuration runs so fast that two threads never accesses the memory in the same time?
My test code is pasted as following.
class Mutex {
public:
unsigned long long _data;
bool tryLock() {
return mtx.try_lock();
}
inline void Lock() {
mtx.lock();
}
inline void Unlock() {
mtx.unlock();
}
void safeSet(const unsigned long long &data) {
Lock();
_data = data;
Unlock();
}
Mutex& operator++ () {
Lock();
_data++;
Unlock();
return (*this);
}
Mutex operator++(int) {
Mutex tmp = (*this);
Lock();
_data++;
Unlock();
return tmp;
}
Mutex() {
_data = 0;
}
private:
std::mutex mtx;
Mutex(Mutex& cpy) {
_data = cpy._data;
}
}val;
static DWORD64 val_unsafe = 0;
DWORD WINAPI safeThreads(LPVOID lParam) {
for (int i = 0; i < 655360;i++) {
++val;
}
return 0;
}
DWORD WINAPI unsafeThreads(LPVOID lParam) {
for (int i = 0; i < 655360; i++) {
val_unsafe++;
}
return 0;
}
int main()
{
val._data = 0;
vector<HANDLE> hThreads;
LARGE_INTEGER freq, time1, time2;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&time1);
for (int i = 0; i < 32; i++) {
hThreads.push_back( CreateThread(0, 0, safeThreads, 0, 0, 0));
}
for each(HANDLE handle in hThreads)
{
WaitForSingleObject(handle, INFINITE);
}
QueryPerformanceCounter(&time2);
cout<<time2.QuadPart - time1.QuadPart<<endl;
hThreads.clear();
QueryPerformanceCounter(&time1);
for (int i = 0; i < 32; i++) {
hThreads.push_back(CreateThread(0, 0, unsafeThreads, 0, 0, 0));
}
for each(HANDLE handle in hThreads)
{
WaitForSingleObject(handle, INFINITE);
}
QueryPerformanceCounter(&time2);
cout << time2.QuadPart - time1.QuadPart << endl;
hThreads.clear();
cout << val._data << endl << val_unsafe<<endl;
cout << freq.QuadPart << endl;
return 0;
}

The standard doesn't let you assume that code is thread safe by default. Your code gives nevertheless correct result when compiled in release mode for x64.
But why ?
If you look at the assembler file generated for your code, you'll find out that that the optimizer simply unrolled the loop and applied constant propagation. So instead of looping 65535 times, it just adds a constant to your counter :
?unsafeThreads##YAKPEAX#Z PROC ; unsafeThreads, COMDAT
; 74 : for (int i = 0; i < 655360; i++) {
add QWORD PTR ?val_unsafe##3_KA, 655360 ; 000a0000H <======= HERE
; 75 : val_unsafe++;
; 76 : }
; 77 : return 0;
xor eax, eax
; 78 : }
In this situation, with a single and very fast instruction in each thread, it's much less probable to get a data race : most probably one thread is already finished before the next is launched.
How to see the expected result from your benchmark ?
If you want to avoid the optimizer to unroll your test loops, you need to declare _data and unsafe_val as volatile. You'll then notice that the unsafe value is no longer correct due to data races. Running my own tests with this modified code, I get the correct value for the safe version, and always different (and wrong) values for the unsafe version. For example:
safe time:5672583
unsafe time:145092 // <=== much faster
val:20971520
val_unsafe:3874844 // <=== OUCH !!!!
freq: 2597654
Want to make your unsafe code safe ?
If you want to make your unsafe code safe but without using an explicit mutex, you could just make unsafe_val an atomic. The result will be platform dependent (the implementation could very well introduce a mutex for you) but on the same machine than above, with MSVC15 in release mode, I get:
safe time:5616282
unsafe time:798851 // still much faster (6 to 7 times in average)
val:20971520
val_unsafe:20971520 // but always correct
freq2597654
The only thing that you then still must do : rename the atomic version of the variable from unsafe_val into also_safe_val ;-)

Custom allocator performance

I'm building an AVL tree class which will have a fixed maximum number of items. So I thought instead of allocating each item by itself, I'd just allocate the entire chunk at once and use a bitmap to assign new memory when needed.
My allocation / deallocation code:
avltree::avltree(UINT64 numitems)
{
root = NULL;
if (!numitems)
buffer = NULL;
else {
UINT64 memsize = sizeof(avlnode) * numitems + bitlist::storagesize(numitems);
buffer = (avlnode *) malloc(memsize);
memmap.init(numitems, buffer + numitems);
memmap.clear_all();
freeaddr = 0;
}
}
avlnode *avltree::newnode(keytype key)
{
if (!buffer)
return new avlnode(key);
else
{
UINT64 pos;
if (freeaddr < memmap.size_bits)
pos = freeaddr++;
else
pos = memmap.get_first_unset();
memmap.set_bit(pos);
return new (&buffer[pos]) avlnode(key);
}
}
void avltree::deletenode(avlnode *node)
{
if (!buffer)
delete node;
else
memmap.clear_bit(node - buffer);
}
In order to use standard new / delete, I have to construct the tree with numitems == 0. In order to use my own allocator, I just pass number of items. All functions are inlined for maximum performance.
This is all fine and dandy, but my own allocator is abut 20% slower than new / delete. Now, I know how complex memory allocators are, there's no way that code can run faster than an array lookup + one bit set, but that is exactly the case here. What's worse: my deallocator is slower even if I remove all code from it?!?
When I check assembly output, my allocator's code path is ridden with QWORD PTR instructions dealing with bitmap, avltree or avlnode. It doesn't seem to be much different for the new / delete path.
For example, assembly output of avltree::newnode:
;avltree::newnode, COMDAT
mov QWORD PTR [rsp+8], rbx
push rdi
sub rsp, 32
;if (!buffer)
cmp QWORD PTR [rcx+8], 0
mov edi, edx
mov rbx, rcx
jne SHORT $LN4#newnode
; return new avlnode(key);
mov ecx, 24
call ??2#YAPEAX_K#Z ; operator new
jmp SHORT $LN27#newnode
;$LN4#newnode:
;else {
; UINT64 pos;
; if (freeaddr < memmap.size_bits)
mov r9, QWORD PTR [rcx+40]
cmp r9, QWORD PTR [rcx+32]
jae SHORT $LN2#newnode
; pos = freeaddr++;
lea rax, QWORD PTR [r9+1]
mov QWORD PTR [rcx+40], rax
; else
jmp SHORT $LN1#newnode
$LN2#newnode:
; pos = memmap.get_first_unset();
add rcx, 16
call ?get_first_unset#bitlist##QEAA_KXZ ; bitlist::get_first_unset
mov r9, rax
$LN1#newnode:
; memmap.set_bit(pos);
mov rcx, QWORD PTR [rbx+16] ;data[bindex(pos)] |= bmask(pos);
mov rdx, r9 ;return pos / (sizeof(BITINT) * 8);
shr rdx, 6
lea r8, QWORD PTR [rcx+rdx*8] ;data[bindex(pos)] |= bmask(pos);
movzx ecx, r9b ;return 1ull << (pos % (sizeof(BITINT) * 8));
mov edx, 1
and cl, 63
shl rdx, cl
; return new (&buffer[pos]) avlnode(key);
lea rcx, QWORD PTR [r9+r9*2]
; File c:\projects\vvd\vvd\util\bitlist.h
or QWORD PTR [r8], rdx ;data[bindex(pos)] |= bmask(pos)
; 195 : return new (&buffer[pos]) avlnode(key);
mov rax, QWORD PTR [rbx+8]
lea rax, QWORD PTR [rax+rcx*8]
; $LN27#newnode:
test rax, rax
je SHORT $LN9#newnode
; avlnode constructor;
mov BYTE PTR [rax+4], 1
mov QWORD PTR [rax+8], 0
mov QWORD PTR [rax+16], 0
mov DWORD PTR [rax], edi
; 196 : }
; 197 : }
; $LN9#newnode:
mov rbx, QWORD PTR [rsp+48]
add rsp, 32 ; 00000020H
pop rdi
ret 0
?newnode#avltree##QEAAPEAUavlnode##H#Z ENDP ; avltree::newnode
_TEXT ENDS
I have checked multiple times the output of compilation when I construct my avltree with default / custom allocator and it remains the same in this particular region of code. I have tried removing / replacing all relevant parts to no significant effect.
To be honest, I expected compiler to inline all of this since there are very few variables. I was hoping for everything except the avlnode objects themselves to be placed in registers, but that doesn't seem to be the case.
Yet the speed difference is clearly measurable. I'm not calling 3 seconds per 10 million nodes inserted slow, but I expected my code to be faster, not slower than generic allocator (2.5 seconds). That goes especially for the slower deallocator which is slower even when all code is stripped from it.
Why is it slower?
Edit:
Thank you all for excellent thoughts on this. But I would like to stress again that the issue is not so much within my method of allocation as it is in suboptimal way of using the variables: the entire avltree class contains just 4 UINT64 variables, bitlist only has 3.
However, despite that, the compiler doesn't optimise this into registers. It insists on QWORD PTR instructions that are orders of magnitude slower. Is this because I'm using classes? Should I move to C / plain variables? Scratch that. Stupid me. I have all the avltree code in there as well, things can't be in registers.
Also, I am at a total loss why my deallocator would still be slower, even if I remove ALL code from it. Yet QueryPerformanceCounter tells me just that. It's insane to even think that: that same deallocator also gets called for the new / delete code path and it has to delete the node... It doesn't have to do anything for my custom allocator (when I strip the code).
Edit2:
I have now completely removed the bitlist and implemented free space tracking through a singly-linked list. The avltree::newnode function is now much more compact (21 instructions for my custom allocator path, 7 of those are QWORD PTR ops dealing with avltree and 4 are used for constructor of avlnode).
The end result (time) decreased from ~3 seconds to ~2.95 seconds for 10 million allocations.
Edit3:
I also rewrote the entire code such that now everything is handled by the singly linked list. Now the avltree class only has two relevant members: root and first_free. The speed difference remains.
Edit4:
Rearranging code and looking at performance figures, these things are what helped the most:
As pointed out by all contributors, having a bitmap in there was just plain bad. Removed in favour of singly-linked free slot list.
Code locality: by adding dependent functions (avl tree handling ones) into a function-local class instead of having them declared globally helped some 15% with code speed (3 secs --> 2.5 secs)
avlnode struct size: just adding #pragma pack(1) before struct declaration decreased execution time a further 20% (2,5 secs --> 2 secs)
Edit 5:
Since this querstion seems to have been quite popular, I have posted the final complete code as an answer below. I'm quite satisfied with its performance.

Your method only allocates the raw memory in one chunk and then has to do a placement new for each element. Combine that with all the overhead in your bitmap and its not too surprising that the default new allocation beats yours assuming an empty heap.
To get the most speed improvement when allocating you can allocate the entire object in one large array and then assign to it from there. If you look at a very simple and contrived benchmark:
struct test_t {
float f;
int i;
test_t* pNext;
};
const size_t NUM_ALLOCS = 50000000;
void TestNew (void)
{
test_t* pPtr = new test_t;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = new test_t;
pPtr = pPtr->pNext;
}
}
void TestBucket (void)
{
test_t* pBuckets = new test_t[NUM_ALLOCS + 2];
test_t* pPtr = pBuckets++;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = pBuckets++;
pPtr = pPtr->pNext;
}
}
With this code on MSVC++ 2013 with 50M allocations TestBucket() outperforms TestNew() by over a factor of x16 (130 vs 2080 ms). Even if you add a std::bitset<> to track allocations it is still x4 faster (400 ms).
An important thing to remember about new is that the time its takes to allocate an object generally depends on the state of the heap. An empty heap will be able to allocate a bunch of constant sized objects like this relatively fast, which is probably one reason why your code seems slower than new. If you have a program that runs for a while and allocates a large number of differently sized objects the heap can become fragmented and allocating objects can take much (much) longer.
As an example, one program I wrote was loading a 200MB file with millions of records...lots of differently sized allocations. On the first load it took ~15 seconds but if I deleted that file and tried to load it again it took something like x10-x20 longer. This was entirely due to memory allocation and switching to a simple bucket/arena allocator fixed the issue. So, that contrived benchmark I did showing a x16 speedup might actually get show a significantly larger difference with a fragmented heap.
It gets even more tricky when you realize that different systems/platforms may use different memory allocation schemes so the benchmark results on one system may be different from another.
To distil this into a few short points:
Benchmarking memory allocation is tricky (performance depends on a lot of things)
In some cases you can get better performance with a custom allocator. In a few cases you can get much better.
Creating a custom allocator can be tricky and takes time to profile/benchmark your specific use case.
Note -- Benchmarks like this aren't meant to be realistic but are useful to determine the upper bound of how fast something can be. It can be used along with the profile/benchmark of your actual code to help determine what should/shouldn't be optimized.
Update -- I can't seem to replicate your results in my code under MSVC++ 2013. Using the same structure as your avlnode and trying a placement new yields the same speed as my non-placement bucket allocator tests (placement new was actually a little bit faster). Using a class similar to your avltree doesn't affect the benchmark much. With 10 million allocations/deallocations I'm getting ~800 ms for the new/delete and ~200ms for the custom allocator (both with and without placement new). While I'm not worried about the difference in absolute times, the relative time difference seems odd.
I would suggest taking a closer look at your benchmark and make sure you are measuring what you think you are. If the code exists in a larger code-base then create a minimal test case to benchmark it. Make sure that your compiler optimizer is not doing something that would invalidate the benchmark (it happens too easily these days).
Note that it would be far easier to answer your question if you had reduced it to a minimal example and included the complete code in the question, including the benchmark code. Benchmarking is one of those things that seems easy but there are a lot of "gotchas" involved in it.
Update 2 -- Including the basic allocator class and benchmark code I'm using so others can try to duplicate my results. Note that this is for testing only and is far from actual working/production code. It is far simpler than your code which may be why we're getting different results.
#include <string>
#include <Windows.h>
struct test_t
{
__int64 key;
__int64 weight;
__int64 left;
__int64 right;
test_t* pNext; // Simple linked list
test_t() : key(0), weight(0), pNext(NULL), left(0), right(0) { }
test_t(const __int64 k) : key(k), weight(0), pNext(NULL), left(0), right(0) { }
};
const size_t NUM_ALLOCS = 10000000;
test_t* pLast; //To prevent compiler optimizations from being "smart"
struct CTest
{
test_t* m_pBuffer;
size_t m_MaxSize;
size_t m_FreeIndex;
test_t* m_pFreeList;
CTest(const size_t Size) :
m_pBuffer(NULL),
m_MaxSize(Size),
m_pFreeList(NULL),
m_FreeIndex(0)
{
if (m_MaxSize > 0) m_pBuffer = (test_t *) new char[sizeof(test_t) * (m_MaxSize + 1)];
}
test_t* NewNode(__int64 key)
{
if (!m_pBuffer || m_FreeIndex >= m_MaxSize) return new test_t(key);
size_t Pos = m_FreeIndex;
++m_FreeIndex;
return new (&m_pBuffer[Pos]) test_t(key);
}
void DeleteNode(test_t* pNode)
{
if (!m_pBuffer) {
delete pNode;
}
else
{
pNode->pNext = m_pFreeList;
m_pFreeList = pNode;
}
}
};
void TestNew(void)
{
test_t* pPtr = new test_t;
test_t* pFirst = pPtr;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = new test_t;
pPtr = pPtr->pNext;
}
pPtr = pFirst;
while (pPtr)
{
test_t* pTemp = pPtr;
pPtr = pPtr->pNext;
delete pTemp;
}
pLast = pPtr;
}
void TestClass(const size_t BufferSize)
{
CTest Alloc(BufferSize);
test_t* pPtr = Alloc.NewNode(0);
test_t* pFirstPtr = pPtr;
for (int i = 0; i < NUM_ALLOCS; ++i)
{
pPtr->pNext = Alloc.NewNode(i);
pPtr = pPtr->pNext;
}
pLast = pPtr;
pPtr = pFirstPtr;
while (pPtr != NULL)
{
test_t* pTmp = pPtr->pNext;
Alloc.DeleteNode(pPtr);
pPtr = pTmp;
}
}
int main(void)
{
DWORD StartTick = GetTickCount();
TestClass(0);
//TestClass(NUM_ALLOCS + 10);
//TestNew();
DWORD EndTick = GetTickCount();
printf("Time = %u ms\n", EndTick - StartTick);
printf("Last = %p\n", pLast);
return 0;
}
Currently I'm getting ~800ms for both TestNew() and TestClass(0) and under 200ms for TestClass(NUM_ALLOCS + 10). The custom allocator is pretty fast as it operates on the memory in a completely linear fashion which allows the memory cache to work its magic. I'm also using GetTickCount() for simplicity and it is accurate enough so long as times are above ~100ms.

It's hard to be certain with such little code to study, but I'm betting on locality of reference. Your bitmap with metadata is not on the same cacheline as the allocated memory itself. And get_first_unset might be a linear search.

Now, I know how complex memory allocators are, there's no way that code can run faster than an array lookup + one bit set, but that is exactly the case here.
This isn't even nearly correct. A decent bucketing low fragmentation heap is O(1) with very low constant time (and effectively zero additional space overhead). I've seen a version that came down to ~18 asm instructions (with one branch) before. Thats a lot less than your code. Remember, heaps may be massively complex in total, but the fast path through them may be really, really fast.

Just for reference, the following code was the most performant for the problem at hand.
It's just a simple avltree implementation, but it does reach 1,7 secs for 10 million inserts and 1,4 secs for equal number of deletes on my 2600K # 4.6 GHz.
#include "stdafx.h"
#include <iostream>
#include <crtdbg.h>
#include <Windows.h>
#include <malloc.h>
#include <new>
#ifndef NULL
#define NULL 0
#endif
typedef int keytype;
typedef unsigned long long UINT64;
struct avlnode;
struct avltree
{
avlnode *root;
avlnode *buffer;
avlnode *firstfree;
avltree() : avltree(0) {};
avltree(UINT64 numitems);
inline avlnode *newnode(keytype key);
inline void deletenode(avlnode *node);
void insert(keytype key) { root = insert(root, key); }
void remove(keytype key) { root = remove(root, key); }
int height();
bool hasitems() { return root != NULL; }
private:
avlnode *insert(avlnode *node, keytype k);
avlnode *remove(avlnode *node, keytype k);
};
#pragma pack(1)
struct avlnode
{
avlnode *left; //left pointer
avlnode *right; //right pointer
keytype key; //node key
unsigned char hgt; //height of the node
avlnode(int k)
{
key = k;
left = right = NULL;
hgt = 1;
}
avlnode &balance()
{
struct F
{
unsigned char height(avlnode &node)
{
return &node ? node.hgt : 0;
}
int balance(avlnode &node)
{
return &node ? height(*node.right) - height(*node.left) : 0;
}
int fixheight(avlnode &node)
{
unsigned char hl = height(*node.left);
unsigned char hr = height(*node.right);
node.hgt = (hl > hr ? hl : hr) + 1;
return (&node) ? hr - hl : 0;
}
avlnode &rotateleft(avlnode &node)
{
avlnode &p = *node.right;
node.right = p.left;
p.left = &node;
fixheight(node);
fixheight(p);
return p;
}
avlnode &rotateright(avlnode &node)
{
avlnode &q = *node.left;
node.left = q.right;
q.right = &node;
fixheight(node);
fixheight(q);
return q;
}
avlnode &b(avlnode &node)
{
int bal = fixheight(node);
if (bal == 2) {
if (balance(*node.right) < 0)
node.right = &rotateright(*node.right);
return rotateleft(node);
}
if (bal == -2) {
if (balance(*node.left) > 0)
node.left = &rotateleft(*node.left);
return rotateright(node);
}
return node; // balancing is not required
}
} f;
return f.b(*this);
}
};
avltree::avltree(UINT64 numitems)
{
root = buffer = firstfree = NULL;
if (numitems) {
buffer = (avlnode *) malloc(sizeof(avlnode) * (numitems + 1));
avlnode *tmp = &buffer[numitems];
while (tmp > buffer) {
tmp->right = firstfree;
firstfree = tmp--;
}
}
}
avlnode *avltree::newnode(keytype key)
{
avlnode *node = firstfree;
/*
If you want to support dynamic allocation, uncomment this.
It does present a bit of an overhead for bucket allocation though (8% slower)
Also, if a condition is met where bucket is too small, new nodes will be dynamically allocated, but never freed
if (!node)
return new avlnode(key);
*/
firstfree = firstfree->right;
return new (node) avlnode(key);
}
void avltree::deletenode(avlnode *node)
{
/*
If you want to support dynamic allocation, uncomment this.
if (!buffer)
delete node;
else {
*/
node->right = firstfree;
firstfree = node;
}
int avltree::height()
{
return root ? root->hgt : 0;
}
avlnode *avltree::insert(avlnode *node, keytype k)
{
if (!node)
return newnode(k);
if (k == node->key)
return node;
else if (k < node->key)
node->left = insert(node->left, k);
else
node->right = insert(node->right, k);
return &node->balance();
}
avlnode *avltree::remove(avlnode *node, keytype k) // deleting k key from p tree
{
if (!node)
return NULL;
if (k < node->key)
node->left = remove(node->left, k);
else if (k > node->key)
node->right = remove(node->right, k);
else // k == p->key
{
avlnode *l = node->left;
avlnode *r = node->right;
deletenode(node);
if (!r) return l;
struct F
{
//findmin finds the minimum node
avlnode &findmin(avlnode *node)
{
return node->left ? findmin(node->left) : *node;
}
//removemin removes the minimum node
avlnode &removemin(avlnode &node)
{
if (!node.left)
return *node.right;
node.left = &removemin(*node.left);
return node.balance();
}
} f;
avlnode &min = f.findmin(r);
min.right = &f.removemin(*r);
min.left = l;
return &min.balance();
}
return &node->balance();
}
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
// 64 bit release performance (for 10.000.000 nodes)
// malloc: insertion: 2,595 deletion 1,865
// my allocator: insertion: 2,980 deletion 2,270
const int nodescount = 10000000;
avltree &tree = avltree(nodescount);
cout << "sizeof avlnode " << sizeof(avlnode) << endl;
cout << "inserting " << nodescount << " nodes" << endl;
LARGE_INTEGER t1, t2, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&t1);
for (int i = 1; i <= nodescount; i++)
tree.insert(i);
QueryPerformanceCounter(&t2);
cout << "Tree height " << (int) tree.height() << endl;
cout << "Insertion time: " << ((double) t2.QuadPart - t1.QuadPart) / freq.QuadPart << " s" << endl;
QueryPerformanceCounter(&t1);
while (tree.hasitems())
tree.remove(tree.root->key);
QueryPerformanceCounter(&t2);
cout << "Deletion time: " << ((double) t2.QuadPart - t1.QuadPart) / freq.QuadPart << " s" << endl;
#ifdef _DEBUG
_CrtMemState mem;
_CrtMemCheckpoint(&mem);
cout << "Memory used: " << mem.lTotalCount << " high: " << mem.lHighWaterCount << endl;
#endif
return 0;
}

Can't return anything other than 1 or 0 from int function

I wish my first post wasn't so newbie. I've been working with openframeworks, so far so good, but as I'm new to programming I'm having a real headache returning the right value from an int function. I would like the int to increment up until the Boolean condition is met and then decrement to zero. The int is used to move through an array from beginning to end and then back. When I put the guts of the function into the method that I'm using the int in, everything works perfectly, but very messy and I wonder how computationally expensive it is to put there, it just seems that my syntactic abilities are lacking to do otherwise. Advice appreciated, and thanks in advance.
int testApp::updown(int j){
if(j==0){
arp =true;
}
else if (j==7){
arp = false;
}
if(arp == true){
j++;
}
else if(arp == false){
j--;
}
return (j);
}
and then its called like this in an audioRequest block of the library I'm working with:
for (int i = 0; i < bufferSize; i++){
if ((int)timer.phasor(sorSpeed)) {
z = updown(_j);
noteOut = notes [z];
cout<<arp;
cout<<z;
}
EDIT: For addition of some information. Removed the last condition of the second if statement, it was there because I was experiencing strange happenings where j would start walking off the end of the array.
Excerpt of testApp.h
int z, _j=0;
Boolean arp;
EDIT 2: I've revised this now, it works, apologies for asking something so rudimentary and with such terrible code to go with. I do appreciate the time that people have taken to comment here. Here are my revised .cpp and my .h files for your perusal. Thanks again.
#include "testApp.h"
#include <iostream>
using namespace std;
testApp::~testApp() {
}
void testApp::setup(){
sampleRate = 44100;
initialBufferSize = 1024;
//MidiIn.openPort();
//ofAddListener(MidiIn.newMessageEvent, this, &testApp::newMessage);
j = 0;
z= 0;
state = 1;
tuning = 440;
inputNote = 127;
octave = 4;
sorSpeed = 2;
freqOut = (tuning/32) * pow(2,(inputNote-69)/12);
finalOut = freqOut * octave;
notes[7] = finalOut+640;
notes[6] = finalOut+320;
notes[5] = finalOut+160;
notes[4] = finalOut+840;
notes[3] = finalOut+160;
notes[2] = finalOut+500;
notes[1] = finalOut+240;
notes[0] = finalOut;
ofSoundStreamSetup(2,0,this, sampleRate, initialBufferSize, 4);/* Call this last ! */
}
void testApp::update(){
}
void testApp::draw(){
}
int testApp::updown(int &_j){
int tmp;
if(_j==0){
arp = true;
}
else if(_j==7) {
arp = false;
}
if(arp == true){
_j++;
}
else if(arp == false){
_j--;
}
tmp = _j;
return (tmp);
}
void testApp::audioRequested (float * output, int bufferSize, int nChannels){
for (int i = 0; i < bufferSize; i++){
if ((int)timer.phasor(sorSpeed)) {
noteOut = notes [updown(z)];
}
mymix.stereo(mySine.sinewave(noteOut),outputs,0.5);
output[i*nChannels ] = outputs[0];
output[i*nChannels + 1] = outputs[1];
}
}
testApp.h
class testApp : public ofBaseApp{
public:
~testApp();/* destructor is very useful */
void setup();
void update();
void draw();
void keyPressed (int key);
void keyReleased(int key);
void mouseMoved(int x, int y );
void mouseDragged(int x, int y, int button);
void mousePressed(int x, int y, int button);
void mouseReleased(int x, int y, int button);
void windowResized(int w, int h);
void dragEvent(ofDragInfo dragInfo);
void gotMessage(ofMessage msg);
void newMessage(ofxMidiEventArgs &args);
ofxMidiIn MidiIn;
void audioRequested (float * input, int bufferSize, int nChannels); /* output method */
void audioReceived (float * input, int bufferSize, int nChannels); /* input method */
Boolean arp;
int initialBufferSize; /* buffer size */
int sampleRate;
int updown(int &intVar);
/* stick you maximilian stuff below */
double filtered,sample,outputs[2];
maxiFilter filter1;
ofxMaxiMix mymix;
ofxMaxiOsc sine1;
ofxMaxiSample beats,beat;
ofxMaxiOsc mySine,myOtherSine,timer;
int currentCount,lastCount,i,j,z,octave,sorSpeed,state;
double notes[8];
double noteOut,freqOut,tuning,finalOut,inputNote;
};

It's pretty hard to piece this all together. I do think you need to go back to basics a bit, but all the same I think I can explain what is going on.
You initialise _j to 0 and then never modify the value of _j.
You therefore call updown passing 0 as the parameter every time.
updown returns a value of 1 when the input is 0.
Perhaps you meant to pass z to updown when you call it, but I cannot be sure.
Are you really declaring global variables in your header file? That's not good. Try to use local variables and/or parameters as much as possible. Global variables are pretty evil, especially declared in the header file like that!

How to optimize such simple data (event) casting Class?

So I have created compilable prototype for a graph element that can cast its data to subscribed functions.
//You can compile it with no errors.
#include <iostream>
#include <vector>
using namespace std ;
class GraphElementPrototype {
// we should define prototype of functions that will be subscribers to our data
typedef void FuncCharPtr ( char *) ;
public:
//function for preparing class to work
void init()
{
sample = new char[5000];
}
// function for adding subscribers functions
void add (FuncCharPtr* f)
{
FuncVec.push_back (f) ;
} ;
// function for data update
void call()
{
// here would have been useful code for data update
//...
castData(sample);
} ;
//clean up init
void clean()
{
delete[] sample;
sample = 0;
}
private:
//private data object we use in "call" public class function
char* sample;
//Cast data to subscribers and clean up given pointer
void castData(char * data){
for (size_t i = 0 ; i < FuncVec.size() ; i++){
char * dataCopy = new char[strlen(data)];
memcpy (dataCopy,data,strlen(data));
FuncVec[i] (dataCopy) ;}
}
// vector to hold subscribed functions
vector<FuncCharPtr*> FuncVec ;
} ;
static void f0 (char * i) { cout << "f0" << endl; delete[] i; i=0; }
static void f1 (char * i) { cout << "f1" << endl; delete[] i; i=0; }
int main() {
GraphElementPrototype a ;
a.init();
a.add (f0) ;
a.add (f1) ;
for (int i = 0; i<50000; i++)
{
a.call() ;
}
a.clean();
cin.get();
}
Is it possible to optimize my data casting system? And if yes how to do it?

Implement the program correctly and safely
If performance Not Acceptable
While Not Acceptable
Profile
Optimize
Done!
In my experience, premature optimization is the devil.
EDIT:
Apparently while I was formatting my answer, another James ninja'd me with a similar answer. Well played.

Is it possible to optimize my data casting system? And if yes how to do it?
If your program is not too slow then there is no need to perform optimizations. If it is too slow, then generally, improving its performance should be done like so:
Profile your program
Identify the parts of your program that are the most expensive
Select from the parts found in step 2 those that are likely to be (relatively) easy to improve
Improve those parts of the code via refactoring, rewriting, or other techniques of your choosing
Repeat these steps until your program is no longer too slow.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to save the registers value using inline Assembler - c++

It's possible (as pointed out in comments) that you don't need to write assembly for this, perhaps you can get away with just using setjmp()/longjmp(), and have them do the necessary state-saving.

Related

Modifying integer value from another process using process_vm_readv

Are compiler optimization solving thread safety issues?

Custom allocator performance

Can't return anything other than 1 or 0 from int function

How to optimize such simple data (event) casting Class?

Categories

Resources