I have a certain problem when using FmodEx. I've searched thoroughly over the net to see if someone had my same problem but I didn't find anything related to it.
I made a class that loads and plays my sounds, in this case, streams. Here is my code:
Audio::Audio()
{
//Create system object//
m_Result = FMOD::System_Create(&m_pSystem);
ErrorCheck(m_Result);
//Check FMOD version//
m_Result = m_pSystem->getVersion(&m_FmodVersion);
if(m_FmodVersion < FMOD_VERSION)
MessageBox(NULL, FMOD_ErrorString(m_Result), "FMOD Version Error", MB_OK);
//Check if hardware acceleration is disabled//
m_pSystem->getDriverCaps(0, &m_Caps, 0, &m_SpeakerMode);
if (m_Caps & FMOD_CAPS_HARDWARE_EMULATED)
MessageBox(NULL, FMOD_ErrorString(m_Result), "FMOD Acceleration Error", MB_OK);
//Initialize system object//
m_Result = m_pSystem->init(2, FMOD_INIT_NORMAL, 0);
ErrorCheck(m_Result);
m_pChannel = 0;
m_IsLoaded = false;
}
void Audio::LoadMusic(char *filename)
{
m_Result = m_pSystem->createStream(filename, FMOD_CREATESTREAM, 0, &m_pSound);
ErrorCheck(m_Result);
}
void Audio::Play()
{
SetPause(false);
m_Result = m_pSystem->playSound(FMOD_CHANNEL_FREE, m_pSound, false, &m_pChannel);
ErrorCheck(m_Result);
SetPause(true);
}
After this I just do:
pAudio->LoadMusic("test.mp3");
pAudio->Play();
The sound plays no problem. The problem happens when loading the stream. The memory used keeps increasing all the time and it won't stop. I'm guessing that this happens because the small buffer it's using to read the mp3 stream is not beeing freed, thus, it looks for the next available piece of free memory in the RAM, thus the memory usage of the program doesn't stop increasing.
I thought that maybe using the "release" method after each play would work, but then I noticed that release frees ALL the memory in the sound instance.
Could anyone give me some pointers on to what I'm doing wrong here? How do I prevent this?
I'm not sure if I have made it clear enough or not.
Thanks in advance for the help.
Each time you call pAudio->LoadMusic you will allocate (leak) more memory because you are creating a new FMOD::Sound instance (which as you indicate has its own stream buffer). If you simply want to play the sound again, just call pAudio->Play and the stream will restart.
If you are concerned about FMOD memory usage you can call Memory_GetStats to monitor it, just in case I have miss-understood your usage and something else is causing the leak.
Related
Currently, I am working on real time interface with Visual Studio C++.
I faced problem is, when buffer is running for data store, that time .exe is not responding at the point data store in buffer. I collect data as 130Hz from motion sensor. I have tried to increase virtual memory of computer, but problem was not solved.
Code Structure:
int main(){
int no_data = 0;
float x_abs;
float y_abs;
int sensorID = 0;
while (1){
// Define Buffer
char before_trial_output_data[][8 * 4][128] = { { { 0, }, }, };
// Collect Real Time Data
x_abs = abs(inchtocm * record[sensorID].y);
y_abs = abs(inchtocm * record[sensorID].x);
//Save in buffer
sprintf(before_trial_output_data[no_data][sensorID], "%d %8.3f %8.3f\n",no_data,x_abs,y_abs);
//Increment point
no_data++;
// Break While loop, Press ESc key
if (GetAsyncKeyState(VK_ESCAPE)){
break;
}
}
//Data Save in File
printf("\nSaving results to 'RecordData.txt'..\n");
FILE *fp3 = fopen("RecordData.dat", "w");
for (i = 0; i<no_data-1; i++)
fprintf(fp3, output_data[i][sensorID]);
fclose(fp3);
printf("Complete...\n");
}
The code you posted doesn't show how you allocate more memory for your before_trial_output_data buffer when needed. Do you want me to guess? I guess you are using some flavor of realloc(), which needs to allocate ever-increasing amount of memory, fragmenting your heap terribly.
However, in order for you to save that data to a file later on, it doesn't need to be in continuous memory, so some kind of list will work way better than an array.
Also, there is no provision in your "pseudo" code for a 130Hz reading; it processes records as fast as possible, and my guess is - much faster.
Is your prinf() call also a "pseudo code"? Otherwise you are looking for trouble by having mismatch of the % format specifications and number and type of parameters passed in.
So I used this example of the HeapWalk function to implement it into my app. I played around with it a bit and saw that when I added
HANDLE d = HeapAlloc(hHeap, 0, sizeof(int));
int* f = new(d) int;
after creating the heap then some new output would be logged:
Allocated block Data portion begins at: 0X037307E0
Size: 4 bytes
Overhead: 28 bytes
Region index: 0
So seeing this I thought I could check Entry.wFlags to see if it was set as PROCESS_HEAP_ENTRY_BUSY to keep a track of how much allocated memory I'm using on the heap. So I have:
HeapLock(heap);
int totalUsedSpace = 0, totalSize = 0, largestFreeSpace = 0, largestCounter = 0;
PROCESS_HEAP_ENTRY entry;
entry.lpData = NULL;
while (HeapWalk(heap, &entry) != FALSE)
{
int entrySize = entry.cbData + entry.cbOverhead;
if ((entry.wFlags & PROCESS_HEAP_ENTRY_BUSY) != 0)
{
// We have allocated memory in this block
totalUsedSpace += entrySize;
largestCounter = 0;
}
else
{
// We do not have allocated memory in this block
largestCounter += entrySize;
if (largestCounter > largestFreeSpace)
{
// Save this value as we've found a bigger space
largestFreeSpace = largestCounter;
}
}
// Keep a track of the total size of this heap
totalSize += entrySize;
}
HeapUnlock(heap);
And this appears to work when built in debug mode (totalSize and totalUsedSpace are different values). However, when I run it in Release mode totalUsedSpace is always 0.
I stepped through it with the debugger while in Release mode and for each heap it loops three times and I get the following flags in entry.wFlags from calling HeapWalk:
1 (PROCESS_HEAP_REGION)
0
2 (PROCESS_HEAP_UNCOMMITTED_RANGE)
It then exits the while loop and GetLastError() returns ERROR_NO_MORE_ITEMS as expected.
From here I found that a flag value of 0 is "the committed block which is free, i.e. not being allocated or not being used as control structure."
Does anyone know why it does not work as intended when built in Release mode? I don't have much experience of how memory is handled by the computer, so I'm not sure where the error might be coming from. Searching on Google didn't come up with anything so hopefully someone here knows.
UPDATE: I'm still looking into this myself and if I monitor the app using vmmap I can see that the process has 9 heaps, but when calling GetProcessHeaps it returns that there are 22 heaps. Also, none of the heap handles it returns matches to the return value of GetProcessHeap() or _get_heap_handle(). It seems like GetProcessHeaps is not behaving as expected. Here is the code to get the list of heaps:
// Count how many heaps there are and allocate enough space for them
DWORD numHeaps = GetProcessHeaps(0, NULL);
HANDLE* handles = new HANDLE[numHeaps];
// Get a handle to known heaps for us to compare against
HANDLE defaultHeap = GetProcessHeap();
HANDLE crtHeap = (HANDLE)_get_heap_handle();
// Get a list of handles to all the heaps
DWORD retVal = GetProcessHeaps(numHeaps, handles);
And retVal is the same value as numHeaps, which indicates that there was no error.
Application Verifier had been set up previously to do a full page heap verifying of my executable and was interfering with the heaps returned by GetProcessHeaps. I'd forgotten about it being set up as it was done for a different issue several days ago and then closed without clearing the tests. It wasn't happening in debug build because the application builds to a different file name for debug builds.
We managed to detect this by adding a breakpoint and looking at the callstack of the thread. We could see the AV DLL had been injected in and that let us know where to look.
While profiling my CUDA application with NVIDIA Visual Profiler I noticed that any operation after cudaStreamSynchronize blocks until all streams are finished. This is very odd behavior because if cudaStreamSynchronize returns that means that the stream is finished, right? Here is my pseudo-code:
std::list<std::thread> waitingThreads;
void startKernelsAsync() {
for (int i = 0; i < 200; ++i) {
cudaHostAlloc(cpuPinnedMemory, size, cudaHostAllocDefault);
memcpy(cpuPinnedMemory, data, size);
cudaMalloc(gpuMemory);
cudaStreamCreate(&stream);
cudaMemcpyAsync(gpuMemory, cpuPinnedMemory, size, cudaMemcpyHostToDevice, stream);
runKernel<<<32, 32, 0, stream>>>(gpuMemory);
cudaMemcpyAsync(cpuPinnedMemory, gpuMemory, size, cudaMemcpyDeviceToHost, stream);
waitingThreads.push_back(std::move(std::thread(waitForFinish, cpuPinnedMemory, stream)));
}
while (waitingThreads.size() > 0) {
waitingThreads.front().join();
waitingThreads.pop_front();
}
}
void waitForFinish(void* cpuPinnedMemory, cudaStream_t stream, ...) {
cudaStreamSynchronize(stream);
cudaStreamDestroy(stream); // <== This blocks until all streams are finished.
memcpy(data, cpuPinnedMemory, size);
cudaFreeHost(cpuPinnedMemory);
cudaFree(gpuMemory);
}
If I put cudaFreeHost before cudaStreamDestroy then it becomes the blocking operation.
Is there anything conceptually wrong here?
EDIT: I found another weird behavior, sometimes it un-blocks in the middle of processing of streams and then processes the rest of streams.
Normal behavior:
Strange behavior (happens quite often):
EDIT2: I am testing on Tesla K40c card with compute capability 3.5 on CUDA 6.0.
As suggested in comments, it may be viable to reduce number of streams however in my application the memory transfers are quite fast and I want to use streams mainly to dynamically schedule work to GPU. The problem is that after stream finishes I need to download data from pinned memory and clear allocated memory for further streams which seems to be blocking operation.
I am using one stream per data-set because every data-set has different size and processing takes unpredictably long time.
Any ideas how to solve this?
I haven't found why the operations are blocking but I concluded that I can not do anything about it so I decided ti implement memory and streams pooling (as suggested in comments) to re-use GPU memory, pinned CPU memory and streams to avoid any kind of deletion.
In case anybody would be interested here is my solution. The start kernel behaves as asynchronous operation that schedules kernel and callback is called after the kernel is finished.
std::vector<Instance*> m_idleInstances;
std::vector<Instance*> m_workingInstances;
void startKernelAsync(...) {
// Search for finished stream.
while (m_idleInstances.size() == 0) {
findFinishedInstance();
if (m_idleInstances.size() == 0) {
std::chrono::milliseconds dur(10);
std::this_thread::sleep_for(dur);
}
}
Instance* instance = m_idleInstances.back();
m_idleInstances.pop_back();
// Fill CPU pinned memory
cudaMemcpyAsync(..., stream);
runKernel<<<32, 32, 0, stream>>>(gpuMemory);
cudaMemcpyAsync(..., stream);
m_workingInstances.push_back(clusteringInstance);
}
void findFinishedInstance() {
for (auto it = m_workingInstances.begin(); it != m_workingInstances.end();) {
Instance* inst = *it;
cudaError_t status = cudaStreamQuery(inst->stream);
if (status == cudaSuccess) {
it = m_workingInstances.erase(it);
m_callback(instance->clusterGroup);
m_idleInstances.push_back(inst);
}
else {
++it;
}
}
}
And at the and just wait for everybody to finish:
virtual void waitForFinish() {
while (m_workingInstances.size() > 0) {
Instance* instance = m_workingInstances.back();
m_workingInstances.pop_back();
m_idleInstances.push_back(instance);
cudaStreamSynchronize(instance->stream);
finalizeInstance(instance);
}
}
And here is a graph form profiler, works as a charm!
Check out the list of "Implicit Synchronization" rules in the Cuda C Programming Guide PDF that comes with the toolkit. (Section 3.2.5.5.4 in my copy, but you might have a different version.)
If your GPU is "compute capability 3.0 or lower", there are some special rules that apply. My guess would be that cudaStreamDestroy() is hitting one of those limitations.
I'm currently re-creating a memory modifier application using C++, the original was in C#.
All credit goes to "gimmeamilk" who's tutorials Ive been following on YouTube(video 1 of 8). I would highly recommend these tutorials for anyone attempting to create a similar application.
The problem I have is that my VirtualQueryEx seems to run forever. The process I'm scanning is "notepad.exe" and I am passing to the application via command line parameter.
std::cout<<"Create scan started\n";
#define WRITABLE (PAGE_READWRITE | PAGE_WRITECOPY | PAGE_EXECUTE_READWRITE | PAGE_EXECUTE_WRITECOPY) //These are all the flags that will be used to determin if a memory block is writable.
MEMBLOCK * mb_list = NULL; //pointer to the head of the link list to be returned
MEMORY_BASIC_INFORMATION meminfo; //holder for the VirtualQueryEx return struct
unsigned char *addr = 0; //holds the value to pass to VirtualQueryEx
HANDLE hProc = OpenProcess(PROCESS_ALL_ACCESS,false, pid);
if(hProc)
{
while(1)
{
if(VirtualQueryEx(hProc,addr, &meminfo, sizeof(meminfo)) == 0)
{
break;
}
if((meminfo.State & MEM_COMMIT) && (meminfo.Protect & WRITABLE)) //((binary comparison of meminfos state and MEM_COMMIT, this is basically filtering out memory that the process has reserved but not used)())
{
MEMBLOCK * mb = create_memblock(hProc, &meminfo);
if(mb)
{
mb->next = mb_list;
mb_list = mb;
}
}
addr = (unsigned char *)meminfo.BaseAddress + meminfo.RegionSize;//move the adress along by adding on the length of the current block
}
}
else
{
std::cout<<"Failed to open process\n";
}
std::cout<<"Create scan finished\n";
return mb_list;
The output from this code results in
Create scan started on process:7228
Then it does not return anything else to the console. Unfortunately the example source code linked to via the Youtube video is no longer available.
(7228 will change based on the current pid of notepad.exe)
edit-reply to question #Hans Passant
I still don't understand, what I think Im doing is
Starting a infinite loop
{
Testing using vqx if the address is valid and populating my MEM_BASIC_etc..
{
(has the process commited to using that addr of memory)(is the memory writeable)
{
create memblock etc
}
}
move the address along by the size of the current block
}
My program is x32 and so is notepad (as far as I'm aware).
Is my problem that because I'm using a x64 bit OS that I'm actually inspecting half of a block (a block here meaning the unit assigned by the OS in memory) and its causing it to loop?
Big thanks for your help! I want to understand my problem as well as fix it.
Your problem is you're compiling a 32 bit program and using it to parse the memory of a 64 bit program. You define 'addr' as a unsigned char pointer, which in this case is 32 bits in size. It cannot contain a 64 bit address, which is the cause of your problem.
If your target process is 64 bit, compile your program as 64 bit as well. For 32 bit target processes, compile for 32 bit. This is typically the best technique for dealing with the memory of external processes and is the fastest solution.
Depending on what you're doing, you can also use #ifdef and other conditionals to use 64 bit variables depending on the target, but the original solution is usually easier.
Please look at this code below.
#include <windows.h>
void Write(char *pBuffer)
{
// pBuffer -= 4*sizeof(int);
for(int i = 0; i<20; i++)
*(pBuffer + sizeof(int)*i) = i+1;
}
void main()
{
HANDLE hFile = ::CreateFile("file", GENERIC_READ|GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
if(INVALID_HANDLE_VALUE == hFile)
{
::MessageBox(NULL, "", "Error", 0);
return;
}
HANDLE hMMF = ::CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 32, NULL);
char *pBuffer = (char*)::MapViewOfFile(hMMF, FILE_MAP_WRITE, 0, 0, 0);
Write(pBuffer);
::FlushViewOfFile(pBuffer, 100);
::UnmapViewOfFile(pBuffer);
}
I have allocated only 32 bytes yet when I attempt to write past the allocated size, I don't get any error at all. Is this by design or is this a bug in Windows code? However, if you include the commented part, it gives error, as expected.
I ask this because I am thinking of using this "feature" to my advantage. Can I? FYI, I have Win XP ver 2002 SP 3 but I suspect this to be "fixed" in newer Windows' which might fail my code, IDK. Any useful link explaining some internals of this would really help.
Thanks
This isn't any different then writing past the end of a buffer that's allocated on the heap. The operating system can only slap your fingers if you write to virtual memory that isn't mapped. Mapping is page based, one page is 4096 bytes. You'll have to write past this page to get the kaboom. Change your for-loop to end at (4096+4)/4 to repro it.
The virtual memory manager has to map memory by the page, so the extent will in effect be rounded up to the nearest 4kB (or whatever your system page size is).
I don't think it's documented whether writes into the same page as mapped data, but beyond the end of the mapping, will be committed back to the file. So don't rely on that behavior, it could easily change between Windows versions.