I am using cusparseDgtsv_nopivot function to solve a tridiagonal system of equation. the output is correct but the function does not make proper use of cuda multi-streaming.
The nvvp profiler shows that although every call to this solver is in a different stream they never overlap.
I thought on implicit synchronization and found out through nvvp the library function has a lot of calls to cudaFree in between.
Is there a way to avoid this implicit synchronization?
Pseudocode of the use of cusparse:
create array of streams[];
create cusparse handle;
for (int i=0;i<Nsystem;i++){
cusparseSetStream(handle,stream[i]);
cusparseDgtsv_nopivot(handle, var for linear system i);
}
destroy cusaprse handle;
PS: similar cudafree issue was raised and solved dealing with matrices: here.
The really short answer is no. There is presently no way to modify the synchronisation behaviour of cudaFree within the runtime API.
So if, as you hypothesize, the cause of the problem is internal use of malloc and free with cuSolver, then the only thing to do would be report your user case to NVIDIA and see whether they can either propose a workaround, or provide an "expert" version of the routine where the caller manages scratch space explicitly.
Related
So I am trying to use parallel for each..
I have code where I do:
Source s;..
parallel_for_each(begin(_allocs), end(_allocs), [&s] (Allocs_t::value_type allocation) {
// cool stuff with allocation
}
This works, and works well. however, I've seen in many posts that I should call tbb:task_scheduler_init before scheduling tasks.
The problem is that I override malloc and calloc and I can't have the init call malloc and calloc(which it does..)
So the questions are:
why does it work well? DOES it work well?
is there a way to give intel a specific allocator for all its purposes?
Thanks
Instantiation of tbb:task_scheduler_init object is optional. TBB has lazy auto-initialization mechanism which constructs everything on the first call to TBB algorithms/scheduler. Auto-initialization is equal to construction of a global task_scheduler_init object just before your first call to TBB.
Of course, if you need to override the default number of threads, specify the scope where TBB should be initialized, or specify the size of the stack for workers, explicit initialization is unavoidable.
TBB scheduler uses either its own scalable allocator (tbbmalloc.dll, libtbbmalloc.so ..) if found nearby to tbb binaries, or it falls back to using malloc otherwise. There is no way to explicitly specify any other allocator to be used by the scheduler (unlike TBB containers which have corresponding template argument).
Given all the above, I think you have two [combinable] options:
Ensure that TBB scheduler uses its own allocator, so you don't have to worry about correct replacement of malloc for TBB.
Or/and, ensure that the only task_scheduler_init is created and destroyed at the points (scope) where malloc/free are in consistent state.
Using the following code on VS 2012, native C++ development:
SIZE_T CppUnitTests_MemoryValidation::TakeMemoryUsageSnapshot() {
PROCESS_MEMORY_COUNTERS_EX processMemoryCounter;
GetProcessMemoryInfo(GetCurrentProcess(), (PROCESS_MEMORY_COUNTERS*)
&processMemoryCounter, sizeof(processMemoryCounter));
return processMemoryCounter.PrivateUsage;
}
I call this method before and after each CPPUnitTest and calculate the difference of the PrivateUsage field. Normally this difference should be zero, assuming my memory allocation doesn't leak.
Only simple things happen inside my test class. Even without any memory allocation, just creating an instance of my test class and releasing it again, sometimes (not in every test iteration) the difference gets above zero, so this scheme seems to be non-deterministic.
Is there somebody with more insight than me who could either explain how to tackle this or tell me what is wrong with my assumptions?
In short, your assumptions are not correct. There can be a lot of other things going on in your process that perform memory allocation (the Event Tracing thread, and any others created by third-party add-ons on your system) so it is not surprising to see memory use go up occasionally.
Following Hans Passants debug allocator link, I noticed some more information about memory leak detection instrumentation by Microsoft, in special the _CrtMemCheckpoint function(s).
The link i followed was "http://msdn.microsoft.com/en-us/library/5tz9b54s(v=vs.90).aspx"
Now when taking my memory snapshots with this function and checking for a difference using the _CrtMemDifference function, this seems to work reliable and deterministic.
Can I allocate a block on the heap, set its bytes to values that correspond to a function call and its parameters, then use the function call and dereference operators to execute that sequence?
So if I read you right you want to dynamically create CPU assembly instructions on the heap and execute them. A bit like self-modifying code. In theory that's possible, but in practice maybe not.
The problem is that the heap is in a data segment, and CPU's/operating systems nowadays have measures to prevent exactly this kind of behavior (it's called the NX bit, or No-eXecute bit for x86 CPUs). If a segement is marked as NX, you can't execute code from it. This was invented to stop computer virusses from using buffer overflows to place exectuable code in data/heap/stack memory and then try the calling program to execute such code.
Note that DLL's and libraries are loaded in the code segment, which of course allows code execution.
Yes. How else could Dynamic loading and Linking work? Remembering that some (most?) Operating Systems, and some (most?) Linkers are also written in C/C++. For example,
#include <dlfcn.h>
void* initializer = dlsym(sdl_library,"SDL_Init");
if (initializer == NULL) {
// report error ...
} else {
// cast initializer to its proper type and use
}
Also, I believe that a JIT (e.g. GNU lightning and others) in general performs those operations.
In windows, for example, this is now very hard to do when it was once very easy. I used to be able to take an array of bytes in C and then cast it to a function pointer type to execute it... but not any more.
Now, you can do this if you can call Global or VirtualAlloc functions and specifically ask for executable memory. On most platforms its either completely open or massively locked down. Doing this sort of thing on iOS, for example, is a massive headache and it will cause a submission fail on the app store if discovered.
here is some fantastically out of date and crusty code where i did the original thing you described:
https://code.google.com/p/fridgescript/source/browse/trunk/src/w32/Code/Platform_FSCompiledCode.cpp
using bytes from https://code.google.com/p/fsassembler
you may notice in there that i need to provide platform (windows) specific allocation functions to get some executable memory:
https://code.google.com/p/fridgescript/source/browse/trunk/src/w32/Core/Platform_FSExecutableAlloc.cpp
Yes, but you must ensure that the memory is marked executable. How you do that depends on the architecture.
I have the following code:
for i=1:N,
some_mex_file();
end
My MEX file does the following:
Declares an object, of a class I defined, that has 2 large memory blocks, i.e., 32x2048x2 of type double.
Processes the data in this object.
Destroys the object.
I am wondering if it takes more time when I call a MEX file in a loop that allocates large memory blocks for its object. I was thinking of migrating to C++ so that I can declare the object only once and just reset its memory space so that it can be used again and again without new declaration. Is this going to make a difference or going to be a worthless effort? In other words, does it take more time to allocate a memory in MEX file than to declare it once and reuse it?
So, the usual advice here applies: Profile your code (both in Matlab and using a C/C++ profiler), or at least stop it in a debugger several times to see where it's spending its time. Stop "wondering" about where it's spending its time, and actually measure where it's spending its time.
However, I have run into problems like this, where allocating/deallocating memory in the MEX function is the major performance sink. You should verify this, however, by profiling (or stopping the code in a debugger).
The easiest solution to this kind of performance problem is twofold:
Move the loop into the MEX function. Call the MEX function with an iteration count, and let your fast C/C++ code actually perform the loop. This eliminates the cost of calling from Matlab into your MEX function (which can be substantial for large N), and facilitates the second optimization:
Have your MEX function cache its allocation/deallocation, which is much, much easier (and safer) to do if you move the loop into the MEX function. This can be done several ways, but the easiest is to just allocate the space once (outside the loop), and deallocate it once the loop is done.
Background: For a C++ AMP overview, see Daniel Moth's recent BUILD talk.
Going through the initial walk-throughs here, here, here, and here.
Only in that last reference do they make a call to array_view.synchronize().
In these simple examples, is a call to synchronize() not needed? When is it safe to exclude? Can we trust parallel_for_each to behave "synchronously" without it (w/r/t the proceeding code)?
Use synchronize() when you want to access the data without going through the array_view interface. If all of your access to the data uses array_view operators and functions, you don't need to use synchronize(). As Daniel mentioned, the destructor of an array_view forces a synchronize as well, and it's better to call synchronize() in that case so you can get any exceptions that might be thrown.
The synchronize function forces an update to the buffer within the calling context -- that is if you write data on the GPU and then call synchronize in CPU code, at that point the updated values are copied to CPU memory.
This seems obvious from the name, but I mention it because other array_view operations can cause a 'synchronize' as well. C++ AMP array_view tries it best to make copying between the CPU and GPU memory implict -- any operation which reads data through the array view interface will cause a copy as well.
std::vector<int> v(10);
array_view<int, 1> av(10, v);
parallel_for_each(av.grid, [=](index<1> i) restrict(direct3d) {
av[i] = 7;
}
// at this point, data isn't copied back
std::wcout << v[0]; // should print 0
// using the array_view to access data will force a copy
std::wcout << av[0]; // should print 7
// at this point data is copied back
std::wcout << v[0]; // should print 7
my_array_view_instance.synchronize is not required for the simple examples I showed because the destructor calls synchronize. Having said that, I am not following best practice (sorry), which is to explicitly call synchronize. The reason is that if any exceptions are thrown at that point, you would not observe them if you left them up to the destructor, so please call synchronize explicitly.
Cheers
Daniel
Just noticed the second question in your post about parallel_for_each being synchronous vs asyncrhonous (sorry, I am used to 1 question per thread ;-)
"Can we trust parallel_for_each to behave "synchronously" without it (w/r/t the proceeding code)?"
The answer to that is on my post about parallel_for_each:
http://www.danielmoth.com/Blog/parallelforeach-From-Amph-Part-1.aspx
..and also in the BUILD recording you pointed to from 29:20-33:00
http://channel9.msdn.com/Events/BUILD/BUILD2011/TOOL-802T
In a nutshell, no you cannot trust it to be synchronous, it is asyncrhonous. The (implicit or explicit) synchronization point is any code that tries to access the data that is expected to be copied back from the GPU as a result of your parallel loop.
Cheers
Daniel
I'm betting that it's never safe to exclude because in a multi-threaded (concurrent or parallel) it's never safe to assume anything. There are certain guarantees which certain constructs give you but, you have to super careful and meticulous not to break these guarantees by introducing something which you think is fine to do but when in reality there's a lot of complexity underpinning the whole thing.
Haven't spent any time with C++-AMP yet but I'm inclined to try it out.