CUDA for pytorch: CUDA C++ stream and state - c++

I am trying to follow this tutorial and make a simple c++ extension with CUDA backend.
My CPU implementation seems to work fine.
I am having trouble finding examples and documentation (it seems like things are constantly changing).
Specifically,
I see pytorch cuda functions getting THCState *state argument - where does this argument come from? How can I get a state for my function as well?
For instance, in cuda implementation of tensor.cat:
void THCTensor_(cat)(THCState *state, THCTensor *result, THCTensor *ta, THCTensor *tb, int dimension)
However, when calling tensor.cat() from python one does not provide any state argument, pytorch provides it "behind the scene". How pytorch provides this information and how can I get it?
state is then converted to cudaStream_t stream = THCState_getCurrentStream(state);
For some reason, THCState_getCurrentStream is no longer defined? How can I get the stream from my state?
I also tried asking on pytorch forum - so far to no avail.

It's deprecated (without documentation!)
See here:
https://github.com/pytorch/pytorch/pull/14500
In short: use at::cuda::getCurrentCUDAStream()

Related

Can I read a SMT2 file into a solver through the z3 c++ interface?

I've got a problem where the z3 code embedded in a larger system isn't finding a solution to a certain set of constraints (added through the C++ interface) despite some fairly long timeouts. When I dump the constraints to a file (using the to_smt2() method on the solver, just before the call to check()), and run the file through the standalone z3 executable, it solves the system in about 4 seconds (returning sat). For what it's worth, the file is 476,587 lines long, so a fairly big set of constraints.
Is there a way I can read that file back into the embedded solver using the C++ interface, replacing the existing constraints, to see if the embedded version can solve starting from the exact same starting point as the standalone solver? (Essentially, how could I create a corresponding from_smt2(stream) method on the solver class?)
They should be the same set of constraints as now, of course, but maybe there's some ordering effect going on when they are read from the file, or maybe there are some subtle differences in the solver introduced when we embedded it, or something that didn't get written out with to_smt2(). So I'd like to try reading the file back, if I can, to narrow down the possible sources of the difference. Suggestions on what to look for while debugging the long-running version would also be helpful.
Further note: it looks like another user is having similar issues here. Unlike that user, my problem uses all bit-vectors, and the only unknown result is the one from the embedded code. Is there a way to invoke the (get-info :reason-unknown) from the C++ interface, as suggested there, to find out why the embedded version is having a problem?
You can use the method "solver::reason_unknown()" to retrieve explanations for search failure.
There are methods for parsing files and strings into a single expression.
In case of a set of assertions, the expression is a conjunction.
It is perhaps a good idea to add such a method directly to the solver class for convenience. It would be:
void from_smt2_string(char const* smt2benchmark) {
expr fml = ctx().parse_string(smt2benchmark);
add(fml);
}
So if you were to write it outside of the solver class you need to:
expr fml = solver.ctx().parse_string(smt2benchmark);
solver.add(fml);

Different ways to make kernel

In this tutorial
There are 2 methods to run the kernel, and another one mentioned in the comments:
1.
cl::KernelFunctor simple_add(cl::Kernel(program,"simple_add"),queue,cl::NullRange,cl::NDRange(10),cl::NullRange);
simple_add(buffer_A,buffer_B,buffer_C);
However, I found out, that KernelFunctor has gone.
So I tried the alternative way:
2.
cl::Kernel kernel_add=cl::Kernel(program,"simple_add");
kernel_add.setArg(0,buffer_A);
kernel_add.setArg(1,buffer_B);
kernel_add.setArg(2,buffer_C);
queue.enqueueNDRangeKernel(kernel_add,cl::NullRange,cl::NDRange(10),cl::NullRange);
queue.finish();
It compiles and runs succussfully.
However, there is a 3rd option in the comments:
3.
cl::make_kernel simple_add(cl::Kernel(program,"simple_add"));
cl::EnqueueArgs eargs(queue,cl::NullRange,cl::NDRange(10),cl::NullRange);
simple_add(eargs, buffer_A,buffer_B,buffer_C).wait();
Which does not compile, I think the make_kernel needs template arguments.
I'm new to OpenCl, and didn't manage to fix the code.
My question is:
1. How should I modify the 3. code to compile?
2. Which way is better and why? 2. vs. 3.?
You can check the OpenCL C++ Bindings Specification for a detailed description of the cl::make_kernel API (in section 3.6.1), which includes an example of usage.
In your case, you could write something like this to create the kernel functor:
auto simple_add = cl::make_kernel<cl::Buffer&, cl::Buffer&, cl::Buffer&>(program, "simple_add");
Your second question is primarily opinion based, and so is difficult to answer. One could argue that the kernel functor approach is simpler, as it allows you to 'call' the kernel almost as if it were just a function and pass the arguments in a familiar manner. The alternative approach (option 2 in your question) is more explicit about setting arguments and enqueuing the kernel, but more closely represents how you would write the same code using the OpenCL C API. Which method you use is entirely down to personal preference.

Changing an OpenCV function Standard Parameters

Is there a way to permanently change the standard parameters in a OpenCV function?
For example, how can I modify the MSER Feature Detector so that I can call
MserFeatureDetector detector
instead of
MserFeatureDetector detector(10,50,1000)
I am not precisley well versed in the inner mechanisms of C++ libraries, but I imagine the actual program code has to be somewhere, right?
A bit of information on my actual problem:
I'm currently using MEXOpenCV to run OpenCV functions in MatLab, and some MEX-Functions lack (as far as I know) the option to pass input parameters and run with the defaults like this:
detector = cv.FeatureDetector('MSER'); % 'MSER' is the only parameter taken
I recon changing the standard parameters directly at the OpenCV programs would be a way to do it.
Any other ideas on how to solve the actual problem are welcome too!
I solved the actual problem by setting the parameters with the 'set' method of DescriptorExtractor like this
detector=cv.FeatureDetector('MSER'); detector.set('delta',10);

Boost threads: in IOS, thread_info object is being destructed before the thread finishes executing

Our project uses a few boost 1.48 libraries on several platforms, including Windows, Mac, Android, and IOS.
We are able to consistently get the IOS version of the project to crash (nontrivially but reliably) when using IOS, and
from our investigation we see that ~thread_data_base is being called on the thread's thread_info while its thread is still running.
This seems to happen as a result of the smart pointer reaching a zero count, even though it is obviously still
in scope in the thread_proxy function which creates it and runs the requested function in the thread.
This seems to happen in various cases - the call stack is not identical between crashes, though there are a few
variations which are common.
Just to be clear - this often requires running code which is creating hundreds of threads, though there are
never more than about 30 running simultaneously. I have "been lucky" and got it very very early in the
run also, but that's rare.
I created a version of the destructor which actually catches the code red-handed:
in libs/thread/src/pthread/thread.cpp:
thread_data_base::~thread_data_base()
{
boost::detail::thread_data_base* const thread_info=detail::get_current_thread_data();
void *void_thread_info = (void *) thread_info;
void *void_this = (void *) this;
// is somebody destructing the thread_data other than its own thread?
// (remember that its own which should no longer point to it anyway,
// because of the call to detail::set_current_thread_data(0) in thread_proxy)
if (void_thread_info) { // == void_this) {
__builtin_trap();
}
}
I should note that (as seen from the commented-out code) I had previously checked to see that void_thread_info == void_this because I
was only checking for the case where the thread's current thread_info was killing itself.
I have also seen cases where the value returned by get_current_thread_data is non-zero and
different from "this", which is really weird.
Also when I first wrote that version of the code, I wrote:
if (((void*)thread_info) == ((void*)this))
and at run-time I got some very weird exception that said I something about a virtual function table
or something like that - I don't remember. I decided that it was trying to call "==" for this object type
and was unhappy with that, so I rewrote as above, putting the conversions to void * as separate
lines of code. That in itself is quite suspicious to me. I am not one to run to rush to blame compilers, but...
I should also note that when we did catch this happening the trap, we saw the destructor for
~shared_count appear twice consecutively on the stack in Xcode source. Very doubleweird.
We tried to look at the disassembly, but couldn't make much out of it.
Again - it looks like this is always a result of the shared_count which seems to be owned by
the shared_ptr which owns the thread_info reaching zero too early.
Update: it seems that it is possible to get into situations which reach the above trap without the situation doing any harm. Since fixing the issue (see answer) I have seen it happen, but always after thread_info->run() has finished executing. Don't yet understand how...but it's working.
Some additional info:
I should note that the boost.sh from Pete Goodliffe (and modified by others) that is commonly used to compile boost for IOS
has the following note in the header:
: ${EXTRA_CPPFLAGS:="-DBOOST_AC_USE_PTHREADS -DBOOST_SP_USE_PTHREADS"}
# The EXTRA_CPPFLAGS definition works around a thread race issue in
# shared_ptr. I encountered this historically and have not verified that
# the fix is no longer required. Without using the posix thread primitives
# an invalid compare-and-swap ARM instruction (non-thread-safe) was used for the
# shared_ptr use count causing nasty and subtle bugs.
#
# Should perhaps also consider/use instead: -BOOST_SP_USE_PTHREADS
I use those flags, but to no avail.
I found the following which is very tantalizing - it looks like they had the same issue in std::thread:
http://llvm.org/bugs/show_bug.cgi?format=multiple&id=12730
That was suggestive of using an alternate implementation inside boost for arm processors which seems also to directly address this issue:
spinlock_gcc_arm.hpp
The version included with boost 1.48 uses outdated arm assembly.
I took the updated version from boost 1.52, but I'm having trouble compiling it.
I get the following error:
predicated instructions must be in IT block
I found a reference to what looks to be a similar use of this instruction here:
https://zeromq.jira.com/browse/LIBZMQ-414
I was able to use the same idea to get the 1.52 code to compile by modifying
the code as follows (I inserted an appropriate IT instruction)
__asm__ __volatile__(
"ldrex %0, [%2]; \n"
"cmp %0, %1; \n"
"it ne; \n"
"strexne %0, %1, [%2]; \n"
BOOST_SP_ARM_BARRIER :
"=&r"( r ): // outputs
"r"( 1 ), "r"( &v_ ): // inputs
"memory", "cc" );
But in any case, there are ifdefs in this file which look for the arm architecture, which is not defined that way in my environment. After I simply edited the file so that only ARM 7 code
was left, the compiler complains about the definition of BOOST_SP_ARM_BARRIER:
In file included from ./boost/smart_ptr/detail/spinlock.hpp:35:
./boost/smart_ptr/detail/spinlock_gcc_arm.hpp:39:13: error: instruction requires a CPU feature not currently enabled
BOOST_SP_ARM_BARRIER :
^
./boost/smart_ptr/detail/spinlock_gcc_arm.hpp:13:32: note: expanded from macro 'BOOST_SP_ARM_BARRIER'
# define BOOST_SP_ARM_BARRIER "dmb"
Any ideas??
Figured this out. It turns out that the boost.sh script that I mention in the question chose the incorrect boost flag to address this problem - instead of BOOST_SP_USE_PTHREADS (and the other flag there with it, BOOST_AC_USE_PTHREADS) it turns out that what is needed on IOS is BOOST_SP_USE_SPINLOCK. This ends up giving pretty much the identical solution used in the std::thread issue referred to in the question.
If you are compiling for any modern IOS device which uses ARM 7, but using an older boost (we are using 1.48), you need to copy the file spinlock_gcc_arm.hpp from a more recent boost (like 1.52). That file is #ifdef'd for the different arm architectures, but it is not clear to me that the defines it is looking for are defined in the IOS compile environment using the script. So you can either edit the file (violent but effective) or invest some time to figure out how to make this tidy and correct.
In any case, you may need to insert the extra assembly instruction that I did above in the question:
"it ne; \n"
I have not yet gone back to see if I can delete that now that I have my compile environment working problem.
However, we're not done yet. The code used in boost for this option includes, as discussed, ARM assembly language instructions. The ARM chips support two instruction sets which can't be mixed in a given module (not sure of the scope, but evidently file by file is an acceptable granularity when compiling). The instructions used in boost for this locking include non-Thumb instructions, but IOS by default uses the Thumb instruction set. The boost code, aware of the instruction set issue, checks to see that you have arm enabled but not thumb, but by default in IOS, thumb is on.
Getting the compiler to generate non-thumb ARM code depends on which compiler you are using in IOS - Apple's LLVM or LLVM GCC. GCC is deprecated, and Apple's LLVM is the default when you use XCode.
For the default Clang + Apple LLVM 4.1, you need to compile using the -mno-thumb flag. Also any files in your IOS app which use any part of boost which uses smart pointers will also have to be compiled using -mno-thumb.
To compile boost like this, I think you can just add -mno-thumb to the EXTRA_CPP_FLAGS in the script. (I modified the user-config.jam directly while experimenting and haven't yet gone back to clean up.)
For your app, in Xcode you need to select your target, then go into the Build Phases tab, and there select Compile sources. There you have the option of adding compile flags, so for each relevant file (which includes boost), add the -mno-thumb flag. You can do this directly in project.pbxproj also where each file has
settings = { COMPILER_FLAGS = ""; };
you just change this to
settings = { COMPILER_FLAGS = "-mno-thumb"; };
But there's a little more. You also have to modify the darwin.jam file in the tools/build/v2/tools directory. In boost 1.48, there is a code that says:
case arm :
{
options = -arch armv6;
}
This has to be modified to
case arm :
{
options = -arch armv7 ;
}
Finally, in the boost.sh script, in the function writeBjamUserConfig(), you should remove the references to -arch armv6.
If somebody knows how to do this a little more generally and cleanly, I'm sure we'd all benefit. For now, this is where I've gotten to, and I hope that this will help other IOS boost threads users. I hope that the various variants on the boost.sh IOS script out there will be updated. I plan to add some more links to this answer later.
Update: For a great article which describes the issue on the processor level,
see here:
http://preshing.com/20121019/this-is-why-they-call-it-a-weakly-ordered-cpu
Enjoy!
I use boost.asio, boost.thread, boost.smart_ptr etc. on iOS platform, the app always crash when run in release mode, which throws signal sigabrt. The crash call stack is :
__stack_chk_fail
boost::asio::detail::completion_handle
boost::asio::detail::task_ios_service_operation::complete
boost::asio::detail::task_io_service::do_run_one
boost::asio::detail::task_ios_service::run
boost::asio::io_service::run
![when create a asio work with creating new thread and io_service][1]
When trying to solve the problem, I found the following articles:
[boost-thread-threads-not-starting-on-the-iphone-ipad-in-release-build][2]
[The issue of spin_lock and thumb on iOS][3]
Then I try to add -mno-thumb to my project compile flag, and the problem occured in release mode is gone.
However, a new bug bring out : EXC_ARM_DA_ALIGN, which crashed at where I try to convert network data to host-endian.
As[this article][4] says, the ARM instructions strict that the memory data must be aligned.
And follow the article [Exc_arm_da_align][5], I fix it by using memcpy for the data convert, instead of directly converting from the pointer.
[1]: http://i.stack.imgur.com/3ijF4.png
[2]: http://stackoverflow.com/questions/4201262/boost-thread-threads-not-starting-on-the-iphone-ipad-in-release-builds/4245821#4245821
[3]: http://groups.google.com/group/boost-list/browse_thread/thread/7dc1e80659182ab3
[4]: https://brewx.qualcomm.com/bws/content/gi/common/appseng/en/knowledgebase/docs/kb95.html
[5]: http://www.cnblogs.com/unionfind/archive/2013/02/25/2932262.html

How to use CUDA constant memory in a programmer pleasant way?

I'm working on a number crunching app using the CUDA framework. I have some static data that should be accessible to all threads, so I've put it in constant memory like this:
__device__ __constant__ CaseParams deviceCaseParams;
I use the call cudaMemcpyToSymbol to transfer these params from the host to the device:
void copyMetaData(CaseParams* caseParams)
{
cudaMemcpyToSymbol("deviceCaseParams", caseParams, sizeof(CaseParams));
}
which works.
Anyways, it seems (by trial and error, and also from reading posts on the net) that for some sick reason, the declaration of deviceCaseParams and the copy operation of it (the call to cudaMemcpyToSymbol) must be in the same file. At the moment I have these two in a .cu file, but I really want to have the parameter struct in a .cuh file so that any implementation could see it if it wants to. That means that I also have to have the copyMetaData function in the a header file, but this messes up linking (symbol already defined) since both .cpp and .cu files include this header (and thus both the MS C++ compiler and nvcc compiles it).
Does anyone have any advice on design here?
Update: See the comments
With an up-to-date CUDA (e.g. 3.2) you should be able to do the memcpy from within a different translation unit if you're looking up the symbol at runtime (i.e. by passing a string as the first arg to cudaMemcpyToSymbol as you are in your example).
Also, with Fermi-class devices you can just malloc the memory (cudaMalloc), copy to the device memory, and then pass the argument as a const pointer. The compiler will recognise if you are accessing the data uniformly across the warps and if so will use the constant cache. See the CUDA Programming Guide for more info. Note: you would need to compile with -arch=sm_20.
If you're using pre-Fermi CUDA, you will have found out by now that this problem doesn't just apply to constant memory, it applies to anything you want on the CUDA side of things. The only two ways I have found around this are to either:
Write everything CUDA in a single file (.cu), or
If you need to break out code into separate files, restrict yourself to headers which your single .cu file then includes.
If you need to share code between CUDA and C/C++, or have some common code you share between projects, option 2 is the only choice. It seems very unnatural to start with, but it solves the problem. You still get to structure your code, just not in a typically C like way. The main overhead is that every time you do a build you compile everything. The plus side of this (which I think is possibly why it works this way) is that the CUDA compiler has access to all the source code in one hit which is good for optimisation.