How to use CUDA constant memory in a programmer pleasant way?

How to use CUDA constant memory in a programmer pleasant way? - c++

I'm working on a number crunching app using the CUDA framework. I have some static data that should be accessible to all threads, so I've put it in constant memory like this:
__device__ __constant__ CaseParams deviceCaseParams;
I use the call cudaMemcpyToSymbol to transfer these params from the host to the device:
void copyMetaData(CaseParams* caseParams)
{
cudaMemcpyToSymbol("deviceCaseParams", caseParams, sizeof(CaseParams));
}
which works.
Anyways, it seems (by trial and error, and also from reading posts on the net) that for some sick reason, the declaration of deviceCaseParams and the copy operation of it (the call to cudaMemcpyToSymbol) must be in the same file. At the moment I have these two in a .cu file, but I really want to have the parameter struct in a .cuh file so that any implementation could see it if it wants to. That means that I also have to have the copyMetaData function in the a header file, but this messes up linking (symbol already defined) since both .cpp and .cu files include this header (and thus both the MS C++ compiler and nvcc compiles it).
Does anyone have any advice on design here?
Update: See the comments

With an up-to-date CUDA (e.g. 3.2) you should be able to do the memcpy from within a different translation unit if you're looking up the symbol at runtime (i.e. by passing a string as the first arg to cudaMemcpyToSymbol as you are in your example).
Also, with Fermi-class devices you can just malloc the memory (cudaMalloc), copy to the device memory, and then pass the argument as a const pointer. The compiler will recognise if you are accessing the data uniformly across the warps and if so will use the constant cache. See the CUDA Programming Guide for more info. Note: you would need to compile with -arch=sm_20.

If you're using pre-Fermi CUDA, you will have found out by now that this problem doesn't just apply to constant memory, it applies to anything you want on the CUDA side of things. The only two ways I have found around this are to either:
Write everything CUDA in a single file (.cu), or
If you need to break out code into separate files, restrict yourself to headers which your single .cu file then includes.
If you need to share code between CUDA and C/C++, or have some common code you share between projects, option 2 is the only choice. It seems very unnatural to start with, but it solves the problem. You still get to structure your code, just not in a typically C like way. The main overhead is that every time you do a build you compile everything. The plus side of this (which I think is possibly why it works this way) is that the CUDA compiler has access to all the source code in one hit which is good for optimisation.

Related

if we had a single file project that contained all the code can we not use the linker?

Linker question:
if I had a file. c that has no includes at all, would we still need a linker?

Although the linker is so-named because it links together multiple object files, it performs other functions as well. It may resolve addresses that were left incomplete by the compiler. It produces a program in an executable file format that the system’s program loader can read and load, and that format may differ from that of object modules. Specifics depend on the operating system and build tools.
Further, to have a complete program in one source file, you must provide not just the main routine you are familiar with from C and C++ but also the true start of the program, the entry point that the program loader starts execution at, and you must provide implementations for all functions you use in the program, such as invocations of system services via special trap or system-call instructions to read and write data.

You can create a project, which has no typical C startup code, in which case, you may not even have a main(). However, you still need a linker, because the linker creates the required executable file format for the given architecture.
It also will set the entrypoint, where the actual execution starts.
So you can omit the standard libraries, and create a binary, which is completly void of any C functions, but you still need the linker to actually make a runable binary.
The object file format, generated by the compiler, is very different to the executable file format, because it only provides all information, that is required for the linker.

Yes. The linker does more than merely link the files. Check out this resource for more info: https://en.wikibooks.org/wiki/C%2B%2B_Programming/Programming_Languages/C%2B%2B/Code/Compiler/Linker#:~:text=The%20linker%20is%20a%20program,translation%20unit%20have%20external%20linkage.
Believe it or not, multiple libraries can be referenced by default. So, even if you don't #includea resource, the compiler may have to internally link or reference something outside of the translation unit. There are also redundancies and other considerations that are "eliminated" by the compiler.

Despite its name the linker is properly a "linker/locater". It performs two functions - 1) linking object code, 2) determining where in memory the data and code elements exist.
The object code out of the compiler is not "located" even if it has no unresolved links.
Also even if you have the simplest possible valid code:
int main(){ return 0; }
with no includes, the linker will normally implicitly link the C runtime start-up, which is required to do everything necessary before running main(). That may be very little. On some target such as ARM Cortex-M you can in fact run C code directly from the reset vector so long as you don't assume static initialisation or complete library support. So it is possible to write the reset code entirely in C, but you probably still need code to initialise the vector table with the reset handler (your C start-up function) and the initial stack pointer. On Cortex-M that can be done using in-line assembler perhaps, but it is all rather cumbersome and unnecessary and does not forgo the linker.

ELF INIT section code to prepopulate objects used at runtime

I'm fairly new to c++ and am really interested in learning more. Have been reading quite a bit. Recently discovered the init/fini elf sections.
I started to wonder if & how one would use the init section to prepopulate objects that would be used at runtime. Say for example you wanted
to add performance measurements to your code, recording the time, filename, linenumber, and maybe some ID (monotonic increasing int for ex) or name.
You would place for example:
PROBE(0,"EventProcessing",__FILE__,__LINE__)
...... //process event
PROBE(1,"EventProcessing",__FILE__,__LINE__)
......//different processing on same event
PROBE(2,"EventProcessing",__FILE__,__LINE__)
The PROBE could be some macro that populates a struct containing this data (maybe on an array/list, etc using the id as an indexer).
Would it be possible to have code in the init section that could prepopulate all of this data for each PROBE (except for the time of course), so only the time would need to be retrieved/copied at runtime?
As far as I know the __attribute__((constructor)) can not be applied to member functions?
My initial idea was to create some kind of
linked list with each node pointing to each probe and code in the init secction could iterate it populating the id, file, line, etc, but
that idea assumed I could use a member function that could run in the "init" section, but that does not seem possible. Any tips appreciated!

As far as I understand it, you do not actually need an ELF constructor here. Instead, you could emit descriptors for your probes using extended asm statements (using data, instead of code). This also involves switching to a dedicated ELF section for the probe descriptors, say __probes.
The linker will concatenate all the probes and in an array, and generate special symbols __start___probes and __stop___probes, which you can use from your program to access thes probes. See the last paragraph in Input Section Example.
Systemtap implements something quite similar for its userspace probes:
User Space Probe Implementation
Adding User Space Probing to an Application (heapsort example)
Similar constructs are also used within the Linux kernel for its self-patching mechanism.

There's a pretty simple way to have code run on module load time: Use the constructor of a global variable:
struct RunMeSomeCode
{
RunMeSomeCode()
{
// your code goes here
}
} do_it;
The .init/.fini sections basically exist to implement global constructors/destructors as part of the ABI on some platforms. Other platforms may use different mechanisms such as _start and _init functions or .init_array/.deinit_array and .preinit_array. There are lots of subtle differences between all these methods and which one to use for what is a question that can really only be answered by the documentation of your target platform. Not all platforms use ELF to begin with…
The main point to understand is that things like the .init/.fini sections in an ELF binary happen way below the level of C++ as a language. A C++ compiler may use these things to implement certain behavior on a certain target platform. On a different platform, a C++ compiler will probably have to use different mechanisms to implement that same behavior. Many compilers will give you tools in the form of language extensions like __attributes__ or #pragmas to control such platform-specific details. But those generally only make sense and will only work with that particular compiler on that particular platform.

You don't need a member function (which gets a this pointer passed as an arg); instead you can simply create constructor-like functions that reference a global array, like
#define PROBE(id, stuff, more_stuff) \
__attribute__((constructor)) void \
probeinit##id(){ probes[id] = {id, stuff, 0/*to be written later*/, more_stuff}; }
The trick is having this macro work in the middle of another function. GNU C / C++ allows nested functions, but IDK if you can make them constructors.
You don't want to declare a static int dummy#id = something because then you're adding overhead to the function you profile. (gcc has to emit a thread-safe run-once locking mechanism.)
Really what you'd like is some kind of separate pass over the source that identifies all the PROBE macros and collects up their args to declare
struct probe global_probes[] = {
{0, "EventName", 0 /*placeholder*/, filename, linenum},
{1, "EventName", 0 /*placeholder*/, filename, linenum},
...
};
I'm not confident you can make that happen with CPP macros; I don't think it's possible to #define PROBE such that every time it expands, it redefines another macro to tack on more stuff.
But you could easily do that with an awk/perl/python / your fave scripting language program that scans your program and constructs a .c that declares an array with static storage.
Or better (for a single-threaded program): keep the runtime timestamps in one array, and the names and stuff in a separate array. So the cache footprint of the probes is smaller. For a multi-threaded program, stores to the same cache line from different threads is called false sharing, and creates cache-line ping-pong.
So you'd have #define PROBE(id, evname, blah blah) do { probe_times[id] = now(); }while(0)
and leave the handling of the later args to your separate preprocessing.

Is it possible to create a user-defined datatype in a language like C/C++(or maybe any) from a string as user input or from file

Well this might be a very weird question but my curiosity has striken pretty hard on this. So here it goes...
NOTE: Lets take the language C into consideration here.
As programmers we usually define a user-defined datatype(say struct) in the source code with the appropriate name.
Suppose I have a program in which I have a structure defined as:
struct Animal {
char *name;
int lifeSpan;
};
And also I have started the execution of this program.
Now, my question here is;
What if I want to define a new structure called "Plant" just like "Animal" mentioned above in my program, without writing its definition in the source code itself(which is obviously impossible currently) but rather from a user input string(or a file input) during runtime.
Lets say my program takes input string from a text file named file1.txt whose content is:
struct Plant {
char *name;
int lifeSpan;
};
What I want now is to have a new structure named "Plant" in my program which is already in execution. The program should read the file content and create a structure as written in the file and attach it to itself on-the-go.
I have checked out a solution for C++ in the discussion Declaring a data type dynamically in C++ but it doesnt seem to have a very convincing solution.
The solution I am looking for is at the compiler-linker-loader level rather than from the language itself.I would be very pleased and thankful if anyone is looking forward to sharing their ideas on this.

What you're asking about is basically "can we implement C as a scripting language?", since this is the only way code can be executed after compilation.
I'm aware that people have been writing (mostly in the comments) that it's possible in other languages but isn't possible in C, since C is a compiled language (hence data types should be defined during compile time).
However, to the best of my knowledge it's actually possible (and might not be as hard as one would imagine).
There are many possible approaches (machine code emulation (VM), JIT compilation, etc').
One approach will use a C compiler to compile the C script as an external dynamic library (.dll on windows, .so on linux, etc') and than "load" the compiled library and execute the code (this is pretty much the JIT compilation approach, for lazy people).
EDIT:
As mentioned in the comments, by using this approach, the new type is loaded as part of an external library.
The original code won't know about this new type, only the new code (or library) will be "aware" of this new type and able to properly use it.
On the other hand, I'm not sure why you're insisting on the need to use static types and a compiler-linker-loader level solution.
The language itself (the C language) can manage this task dynamically (during execution time).
Consider Ruby MRI, for example. The Ruby language supports dynamic types that can be defined during runtime...
...However, this is implemented in C and it's possible to use the code from within C to define new modules and classes. These aren't static types that can be tested during compilation (type creation and identification is performed during runtime).
This is a perfect example showing that C (as a language) can dynamically define "types".
However, this is also a poor example because Ruby's approach is slow. A custom approved can be far faster since it would avoid the huge overhead related to functionality you might not need (such as inheritance).

Performance penalty for large C++ dll's with autogenerated C code

I am working on a piece of software that needs to call a family of optimisation solvers. Each solver is an auto-generated piece of C code, with thousands of lines of code. I am using 200 of these solvers, differing only in the size of optimisation problem to be solved.
All-in-all, these auto-generated solvers come to about 180MB of C code, which I compile to C++ using the extern "C"{ /*200 solvers' headers*/ } syntax, in Visual Studio 2008. Compiling all of this is very slow (with the "maximum speed /O2" optimisation flag, it takes about 8hours). For this reason I thought it would be a good idea to compile the solvers into a single DLL, which I can then call from a separate piece of software (which would have a reasonable compile time, and allow me to abstract away all this extern "C" stuff from higher-level code). The compiled DLL is then about 37MB.
The problem is that when executing one of these solvers using the DLL, execution requires about 30ms. If I were to compile only that single one solvers into a DLL, and call that from the same program, execution is about 100x faster (<1ms). Why is this? Can I get around it?
The DLL looks as below. Each solver uses the same structures (i.e. they have the same member variables), but they have different names, hence all the type casting.
extern "C"{
#include "../Generated/include/optim_001.h"
#include "../Generated/include/optim_002.h"
/*etc.*/
#include "../Generated/include/optim_200.h"
}
namespace InterceptionTrajectorySolver
{
__declspec(dllexport) InterceptionTrajectoryExitFlag SolveIntercept(unsigned numSteps, InputParams params, double* optimSoln, OutputInfo* infoOut)
{
int exitFlag;
switch(numSteps)
{
case 1:
exitFlag = optim_001_solve((optim_001_params*) &params, (optim_001_output*) optimSoln, (optim_001_info*) &infoOut);
break;
case 2:
exitFlag = optim_002_solve((optim_002_params*) &params, (optim_002_output*) optimSoln, (optim_002_info*) &infoOut);
break;
/*
...
etc.
...
*/
case 200:
exitFlag = optim_200_solve((optim_200_params*) &params, (optim_200_output*) optimSoln, (optim_200_info*) &infoOut);
break;
}
return exitFlag;
};
};

I do not know if your code is inlined into each case part in the example. If your functions are inline functions and you are putting it all inside one function then it will be much slower because the code is laid out in virtual memory, which will require much jumping around for the CPU as the code is executed. If it is not all inlined then perhaps these suggestions might help.
Your solution might be improved by...
A)
1) Divide the project into 200 separate dlls. Then build with a .bat file or similar.
2) Make the export function in each dll called "MyEntryPoint", and then use dynamic linking to load in the libraries as they are needed. This will then be the equivalent of a busy music program with a lot of small dll plugins loaded. Take a function pointer to the EntryPoint with GetProcAddress.
Or...
B) Build each solution as a separate .lib file. This will then compile very quickly per solution and you can then link them all together. Build an array of function pointers to all the functions and call it via lookup instead.
result = SolveInterceptWhichStep;
Combine all the libs into one big lib should not take eight hours. If it takes that long then you are doing something very wrong.
AND...
Try putting the code into different actual .cpp files. Perhaps that specific compiler will do a better job if they are all in different units etc... Then once each unit has been compiled it will stay compiled if you do not change anything.

Make sure that you measure and average the timing multiple calls to the optimizer, because it could be that there's a large overhead to the setup before the first call.
Then also check what that 200-branch conditional statement (your switch) is doing to your performance! Try eliminating that switch for testing, calling just one solver in your test project but linking all of them in the DLL. Do you still see slow performance?

I assume the reason you are generating the code is for better run-time performance, and also for better correctness.
I do the same thing.
I suggest you try this technique to find out what the run-time performance problem is.
If you're seeing a 100:1 performance difference, that means each time you interrupt it and look at the program's state, there is a 99% chance you will see what the problem is.
As far as build time goes, sure it makes sense to modularize it.
None of that should have much effect on run time, unless it means you're doing crazy I/O.

memory address of a member variable of argument objects changes when dll function is called

class SomeClass
{
//some members
MemberClass one_of_the_mem_;
}
I have a function foo( SomeClass *object ) within a dll, it is being called from an exe.
Problem
address of one_of_the_mem_ changes during the time the dll call is dispatched.
Details:
before the call is made (from exe):
'&(this).one_of_the_mem_' - `0x00e913d0`
after - in the dll itself :
'&(this).one_of_the_mem_' - `0x00e913dc`
The address of object remains constant. It is only the member whose address shift by c every time.
I want some pointers regarding how can I troubleshoot this problem.
Code :
Code from Exe
stat = module->init ( this,
object_a,
&object_b,
object_c,
con_dir
);
Code in DLL
Status_C ModuleClass( SomeClass *object, int index, Config *conf, const char* name)
{
_ASSERT(0); //DEBUGGING HOOK
...
Update 1:
I compared the Offsets of members following Michael's instruction and they are the same in both cases.
Update 2:
I found a way to dump the class layout and noticed the difference in size, I have to figure out why is that happening though.
linked is the question that I found to dump class layout.
Update 3:
Final Update : Solved the problem, much thanks to Michael Burr.
it turned out that one of the build was using 32 bit time, _USE_32BIT_TIME_T was defined in it and the other one was using 64 bit time. So it generated the different layout for the object, attached is the difference file.

Your DLL was probably compiled with different set of compiler options (or maybe even a slightly different header file) and the class layout is different as a result.
For example, if one was built using debug flags and other wasn't or even if different compiler versions were used. For example, the libraries used by different compiler versions might have subtle differences and if your class incorporates a type defined by the library you could have different layouts.
As a concrete example, with Microsoft's compiler iterators and containers are sensitive to release/debug, _SECURE_SCL on/off , and _HAS_ITERATOR_DEBUGGING on/off setting (at least up though VS 2008 - VS 2010 may have changed some of this to a certain extent). See http://connect.microsoft.com/VisualStudio/feedback/details/352699/secure-scl-is-broken-in-release-builds for some details.
These kinds of issues make using C++ classes across DLL boundaries a bit more fragile than using straight C interfaces. They can occur in C structures as well, but it seems like C++ libraries have these differences more often (I think that's the nature of having richer functionality).
Another layout-changing issue that occurs every now and then is having a different structure packing option in effect in the different compiles. One thing that can 'hide' this is that pragmas are often used in headers to set structure packing to a certain value, and sometimes you may come across a header that does this without changing it back to the default (or more correctly the previous setting). If you have such a header, it's easy to have it included in the build for one module, but not another.

that sounds a bit wierd, you should show more code, it should 'move' if it being passed by ref, it sounds more like a copy of it is being made and that having the member function called.
Perhaps the DLL versions is compiled against a different version that you are referencing. check and make sure the header file is for the same version as the dll.
Recompile the library if you can.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to use CUDA constant memory in a programmer pleasant way? - c++

Related

if we had a single file project that contained all the code can we not use the linker?

ELF INIT section code to prepopulate objects used at runtime

Is it possible to create a user-defined datatype in a language like C/C++(or maybe any) from a string as user input or from file

Performance penalty for large C++ dll's with autogenerated C code

memory address of a member variable of argument objects changes when dll function is called

Categories

Resources