Halide extern methods - c++

I use AOT compilation to use Halide code without Halide libraries.
I see in HalideRuntime.h (Available on the sources) that I have many extern methods availables in my .o files.
halide_dev_malloc and halide_dev_free are very instresting. I already use halide_copy_to_dev without problem, but I see that my memory was allocated.
If I want to do a simple memcpy between host and device and use halide_dev_malloc instead, is this possible?
Did HalideRuntime.h group all the extern functions available or the object files contains a lot of others?
Jay

HalideRuntime.h is intended to document all the routines that can be called or replaced by clients. There are many other symbols in the runtime, but they should be considered internal. We recently moved these other routines into their own namespace to indicate that they are internal.
The runtime for device backends is still a work in progress and there will be an improved design intended to allow more flexibility and to allow code to do more while still working generically across multiple backends. At present, halide_dev_malloc will allocate the device handle for whichever device backend is selected via the Target at Halide compile time. However, this handle is backend specific and thus in order to do anything with it you must know which backend is used and how that backend interacted with the device API. E.g. in order to use the handle with memcpy, you need to know that the device backend supports some sort of uniform memory architecture ("Unified Virtual Address Space" in CUDA terminology)and the device memory was allocated with the right API calls to make a memory buffer that can be accessed from both the device and CPU with the same pointer, etc. Depending on which backend you are using and which platform you are on, that may or may not work out at present. (Uniform memory designs are a fairly recent thing for the most part. We haven't put a lot of effort into supporting them.)
For CUDA/PTX, halide_dev_malloc calls cuMemAlloc and I think it may be in Unified Virtual Address Space on many systems by default, but I am not sure.

Yes, you can use halide_dev_malloc and copy things yourself manually. See https://github.com/halide/Halide/blob/master/src/runtime/cuda.cpp line 466 for what halide_copy_to_dev actually does.
First it does a halide_dev_malloc, and then uses cuda's cuMemcpyHtoD. There's a bunch of extra logic there in case the buffer isn't dense in memory, but most of the time it turns into a single cuMemcpyHtoD.
I believe HalideRuntime.h contains all the useful extern functions. There are a few other internal ones like halide_create_cuda_context that might conceivably be interesting. To see all of them, look for functions marked as WEAK that start with the name halide_ in this folder: https://github.com/halide/Halide/tree/master/src/runtime
Or you can run nm on a halide-generated object file and see all the symbols that start with halide_.

Related

Sharing pointer across programs in C++

This is related to a previous post:
Allocating a large memory block in C++
I would like a single C++ server running that generates a giant matrix M. Then, on the same machine I would like to run other programs that can contact this server, and get the memory address for M. M is read only, and the server creates it once. I should be able to spawn a client ./test and this programs should be able to make read only access to M. The server should always be running, but I can run other programs like ./test at anytime.
I don't know much about C++ or OS, what is the best way to do this? Should I use POSIX threads? The matrix is a primitive type (double, float etc), and all programs know its dimensions. The client programs require the entire matrix, so I don't want latency from mem copy from the server to the client, I just want to share that pointer directly. What are my best options?
One mechanism of inter-process communication you could definitely use for sharing direct access to you matrix M is shared memory. It means that the OS lets multiple processes access a shared segment in the memory, as if it was in their address space, by mapping it for each one requesting. A solution that answers all your requirements, and is also cross-platform is boost::interprocess. It is a thin portable layer that wraps all of the necessary OS calls. See a working example right here in the docs.
Essentially, your server process just needs to create an object of type boost::interprocess::shared_memory_object, providing the constructor with a name for the shared segment. When calling its truncate() method, the OS will look for a large enough segement in the address space of this server process. From this moment, any other process can create an object of the same type and provide the same name. Now it too has access to the exact same memory. No copies involved.
If for some reason you are unable to use the portable Boost libraries, or for other reason want to restrict the supported platform to Linux, use the POSIX API around the mmap() function. Here's the Linux man page. Usage is basically not far from the Boost pipeline described above. you create the named segment with shm_open() and truncate with ftruncate(). From there onwards you receive the mapped pointer to this allocated space by calling mmap(). In simpler cases where you'll only be sharing between parent and child processes, you can use this code example from this very website.
Of course, no matter what approach you take, when using a shared resource, make sure to synchronize read/writes properly so to avoid any race condition -- as you would have done in the scenario of multiple threads of the same process.
Of course other programs cannot access the matrix as long as it is in the "normal" process memory.
Without questioning the design approach: Yes, you have to use shared memory. Lookup functions like shmget(), shmat() etc. Then you don't need to pass the pointer to another program (which would not work, actually), you simply use the same file in ftok() everywhere to get access to the shared memory.

Allocation numbers in C++ (windows) and its predictibility

I am using _CrtDumpMemoryLeaks to identify memory leaks in our software. We are using a third party library in a multi-threaded application. This library does have memory leaks and therefore in our tests we want to identify those that ours and discard those we do not have any control over.
We use continuous integration so new functions/algorithms/bug fixes get added all the time.
So the question is - is there a safe way of identifying those leaks that are ours and those that are the third parties library. We though about using allocation numbers but is that safe?
In a big application I worked on the global new and delete operators were overwritten (eg. see How to properly replace global new & delete operators) and used private heaps (eg. HeapCreate). Third party libraries would use the process heap and thus the allocation would be clearly separated.
Frankly I don't think you can get far with allocation numbers. Using explicit separate heaps for app/libraries (and maybe even have separate per-component heaps within your own app) would be much more manageable. Consider that you can add your own app specific header to each allocated block and thus enable very fancy memory tracking. For example capture the allocation entire call-stack would be possible, for debugging. Enable per-component accounting. Etc etc.
You might be able to do this using Mirosoft's heap debugging library without using any third-party solutions. Based on what I learned from a previous question here, you should just make sure that all memory allocated in your code is allocated through a call to _malloc_dbg where the second argument is set to _CLIENT_BLOCK. Then you can set a callback function with _CrtSetDumpClient, and that callback will only receive information about the client blocks that were allocated, not the other ones.
You can easily use the preprocessor to convert all the calls to malloc and free to actually call their debugging versions (e.g. _malloc_dbg); just look at how it's done in crtdbg.h which comes with Visual Studio.
The tricky part for me would be figuring out how to override the new and delete operators to call debugging functions like _malloc_dbg. It might be hard to find a solution where only the news and deletes in your own code are affected, and not in the third-party library.
You may want to use DebugDiag Tool provided by Microsoft. For complete information about the tool
we can refer : http://www.microsoft.com/en-sg/download/details.aspx?id=40336
DebugDiag can be used for identifying various issue. We can follow the steps to track down the
leaks(ours and third party module):
Configure the DebugDiag under Rule Type "Native(non .NET) Memory and Handle Leak".
Now Re-run the application for sometime and capture the dump files. We can also configure
the DebugDiag to capture the dump file after specified interval.
Now we can open/analyze the captured dump file using DebugDiag under the "Performance Analyzers".
Once analysis is complete, DebugDiag would automatically generate the report and would give you the
modules/DLL information where leak is possible(with probability). After this we get the information about the modules from DebugDiag tool, we can concentrate on that particular module by doing static code analysis. If modules belongs to third party DLL, we can share the DebugDiag report to them. In addition to this, if you run/attach your application with appropriate PDB file, DebugDiag also provides the call stack from where chances of memory leak is possible.
These information were very useful in the past while debugging memory leak on windows based application. Hopefully above information would be useful.
The answer would REALLY depend on the actual implementation of the third partly library. Does it only leak a consistent number of items, or does that depend on, for example, the number of threads, what functions are used within the library, or some such? When are the allocations made?
Even then if it's a consistent number of leaks regardless of library usage, I'd be hesitant to use this the allocation number. By all means, give it a try. If all the allocations are made very early on, and they don't depend on any of "your" code, then it could work - and it is a REALLY simple thing. But try adding for example a static std::vector<int>(100) to see if memory allocations in static variables are affecting the allocation number... If it does, this method is probably doomed (unless you have very strict rules on static objects).
Using a separate heap (with new/delete operators replaced) would be the correct solution, as this can probably be expanded to gather other statistics too [like number of allocations made, to detect parts of the code that makes excessive allocations - of course, this has to be analysed based on what the code actually does].
The newer Doug Lea malloc's include the mspace abstraction. An mspace is a separate heap. In our couple 100K NCSL application, we use a dozen different mspace's for different parts of the code. We use allocators to have STL containers allocate memory from the right mspace.
Some of the benefits
3rd party code does not use mspaces, so their allocation (and leaks) do not mix with ours
We can look at the memory usage of each mspace to see which piece of code might have memory leaks
Any memory corruption is contained within one mspace thus limiting the amount of code we need to look at for debugging.

Accessing >2,3,4GB Files in 32-bit Process on 64-bit (or 32-bit) Windows

Disclaimer: I apologize for the verbosity of this question (I think it's an interesting problem, though!), yet I cannot figure out how to more concisely word it.
I have done hours of research as to the apparently myriad of ways in which to solve the problem of accessing multi-GB files in a 32-bit process on 64-bit Windows 7, ranging from /LARGEADDRESSAWARE to VirtualAllocEx AWE. I am somewhat comfortable in writing a multi-view memory-mapped system in Windows (CreateFileMapping, MapViewOfFile, etc.), yet can't quite escape the feeling that there is a more elegant solution to this problem. Also, I'm quite aware of Boost's interprocess and iostream templates, although they appear to be rather lightweight, requiring a similar amount of effort to writing a system utilizing only Windows API calls (not to mention the fact that I already have a memory-mapped architecture semi-implemented using Windows API calls).
I'm attempting to process large datasets. The program depends on pre-compiled 32-bit libraries, which is why, for the moment, the program itself is also running in a 32-bit process, even though the system is 64-bit, with a 64-bit OS. I know there are ways in which I could add wrapper libraries around this, yet, seeing as it's part of a larger codebase, it would indeed be a bit of an undertaking. I set the binary headers to allow for /LARGEADDRESSAWARE (at the expense of decreasing my kernel space?), such that I get up to around 2-3 GB of addressable memory per process, give or take (depending on heap fragmentation, etc.).
Here's the issue: the datasets are 4+GB, and have DSP algorithms run upon them that require essentially random access across the file. A pointer to the object generated from the file is handled in C#, yet the file itself is loaded into memory (with this partial memory-mapped system) in C++ (it's P/Invoked). Thus, I believe the solution is unfortunately not as simple as simply adjusting the windowing to access the portion of the file I need to access, as essentially I want to still have the entire file abstracted into a single pointer, from which I can call methods to access data almost anywhere in the file.
Apparently, most memory mapped architectures rely upon splitting the singular process into multiple processes.. so, for example, I'd access a 6 GB file with 3x processes, each holding a 2 GB window to the file. I would then need to add a significant amount of logic to pull and recombine data from across these different windows/processes. VirtualAllocEx apparently provides a method of increasing the virtual address space, but I'm still not entirely sure if this is the best way of going about it.
But, let's say I want this program to function just as "easily" as a singular 64-bit proccess on a 64-bit system. Assume that I don't care about thrashing, I just want to be able to manipulate a large file on the system, even if only, say, 500 MB were loaded into physical RAM at any one time. Is there any way to obtain this functionality without having to write a somewhat ridiculous, manual memory system by hand? Or, is there some better way than what I have found through thusfar combing SO and the internet?
This lends itself to a secondary question: is there a way of limiting how much physical RAM would be used by this process? For example, what if I wanted to limit the process to only having 500 MB loaded into physical RAM at any one time (whilst keeping the multi-GB file paged on disk)?
I'm sorry for the long question, but I feel as though it's a decent summary of what appear to be many questions (with only partial answers) that I've found on SO and the net at large. I'm hoping that this can be an area wherein a definitive answer (or at least some pros/cons) can be fleshed out, and we can all learn something valuable in the process!
You could write an accessor class which you give it a base address and a length. It returns data or throws exception (or however else you want to inform of error conditions) if error conditions arise (out of bounds, etc).
Then, any time you need to read from the file, the accessor object can use SetFilePointerEx() before calling ReadFile(). You can then pass the accessor class to the constructor of whatever objects you create when you read the file. The objects then use the accessor class to read the data from the file. Then it returns the data to the object's constructor which parses it into object data.
If, later down the line, you're able to compile to 64-bit, you can just change (or extend) the accessor class to read from memory instead.
As for limiting the amount of RAM used by the process.. that's mostly a matter of making sure that
A) you don't have memory leaks (especially obscene ones) and
B) destroying objects you don't need at the very moment. Even if you will need it later down the line but the data won't change... just destroy the object. Then recreate it later when you do need it, allowing it to re-read the data from the file.

Is it safe to send a pointer to a static function over the network?

I was thinking about some RPC code that I have to implement in C++ and I wondered if it's safe (and under which assumptions) to send it over the network to the same binary code (assuming it's exactly the same and that they are running on same architecture). I guess virtual memory should do the difference here.
I'm asking it just out of curiosity, since it's a bad design in any case, but I would like to know if it's theoretically possible (and if it's extendable to other kind of pointers to static data other than functions that the program may include).
In general, it's not safe for many reasons, but there are limited cases in which it will work. First of all, I'm going to assume you're using some sort of signing or encryption in the protocol that ensures the integrity of your data stream; if not, you have serious security issues already that are only compounded by passing around function pointers.
If the exact same program binary is running on both ends of the connection, if the function is in the main program (or in code linked from a static library) and not in a shared library, and if the program is not built as a position-independent executable (PIE), then the function pointer will be the same on both ends and passing it across the network should work. Note that these are very stringent conditions that would have to be documented as part of using your program, and they're very fragile; for instance if somebody upgrades the software on one side and forgets to upgrade the version on the other side of the connection at the same time, things will break horribly and dangerously.
I would avoid this type of low-level RPC entirely in favor of a higher-level command structure or abstract RPC framework, but if you really want to do it, a slightly safer approach would be to pass function names and use dlsym or equivalent to look them up. If the symbols reside in the main program binary rather than libraries, then depending on your platform you might need -rdynamic (GCC) or a similar option to make them available to dlsym. libffi might also be a useful tool for abstracting this.
Also, if you want to avoid depending on dlsym or libffi, you could keep your own "symbol table" hard-coded in the binary as a static const linear table or hash table mapping symbol names to function pointers. The hash table format used in ELF for this purpose is very simple to understand and implement, so I might consider basing your implementation on that.
What is it a pointer to?
Is it a pointer to a piece of static program memory? If so, don't forget that it's an address, not an offset, so you'd first need to convert between the two accordingly.
Second, if it's not a piece of static memory (ie: statically allocated array created at build time as opposed to run time) it's not really possible at all.
Finally, how are you ensuring the two pieces of code are the same? Are both binaries bit identical (eg: diff -a binary1 binary2). Even if they are bit-identical, depending on the virtual memory management on each machine, the entire program's program memory segment may not exist in a single page, or the alignment across multiple pages may be different for each system.
This is really a bad idea, no matter how you slice it. This is what message passing and APIs are for.
I don't know of any form of RPC that will let you send a pointer over the network (at least without doing something like casting to int first). If you do convert to int on the sending end, and convert that back to a pointer on the far end, you get pretty much the same as converting any other arbitrary int to a pointer: undefined behavior if you ever attempt to dereference it.
Normally, if you pass a pointer to an RPC function, it'll be marshalled -- i.e., the data it points to will be packaged up, sent across, put into memory, and a pointer to that local copy of the data passed to the function on the other end. That's part of why/how IDL gets a bit ugly -- you need to tell it how to figure out how much data to send across the wire when/if you pass a pointer. Most know about zero-terminated strings. For other types of arrays, you typically need to specify the size of the data (somehow or other).
This is highly system dependent. On systems with virtual addressing such that each process thinks it's running at the same address each time it executes, this could plausibly work for executable code. Darren Kopp's comment and link regarding ASLR is interesting - a quick read of the Wikipedia article suggests the Linux & Windows versions focus on data rather than executable code, except for "network facing daemons" on Linux, and on Windows it applies only when "specifically linked to be ASLR-enabled".
Still, "same binary code" is best assured by static linking - if different shared objects/libraries are loaded, or they're loaded in different order (perhaps due to dynamic loading - dlopen - driven by different ordering in config files or command line args etc.) you're probably stuffed.
Sending a pointer over the network is generally unsafe. The two main reasons are:
Reliability: the data/function pointer may not point to the same entity (data structure or function) on another machine due to different location of the program or its libraries or dynamically allocated objects in memory. Relocatable code + ASLR can break your design. At the very least, if you want to point to a statically allocated object or a function you should sent its offset w.r.t. the image base if your platform is Windows or do something similar on whatever OS you are.
Security: if your network is open and there's a hacker (or they have broken into your network), they can impersonate your first machine and make the second machine either hang or crash, causing a denial of service, or execute arbitrary code and get access to sensitive information or tamper with it or hijack the machine and turn it into an evil bot sending spam or attacking other computers. Of course, there are measures and countermeasures here, but...
If I were you, I'd design something different. And I'd ensure that the transmitted data is either unimportant or encrypted and the receiving part does the necessary validation of it prior to using it, so there are no buffer overflows or execution of arbitrary things.
If you're looking for some formal guarantees, I cannot help you. You would have to look in the documentation of the compiler and OS that you're using - however I doubt that you would find the necessary guarantees - except possibly for some specialized embedded systems OS'.
I can however provide you with one scenario where I'm 99.99% sure that it will work without any problems:
Windows
32 bit process
Function is located in a module that doesn't have relocation information
The module in question is already loaded & initialized on the client side
The module in question is 100% identical on both sides
A compiler that doesn't do very crazy stuff (e.g. MSVC and GCC should both be fine)
If you want to call a function in a DLL you might run into problems. As per the list above the module (=DLL) may not have relocation information, which of course makes it impossible to relocate it (which is what we need). Unfortunately that also means that loading the DLL will fail, if the "preferred load address" is used by something else. So that would be kind-of risky.
If the function resides in the EXE however, you should be fine. A 32 bit EXE doesn't need relocation information, and most don't include it (MSVC default settings). BTW: ASLR is not an issue here since a) ASLR does only move modules that are tagged as wanting to be moved and b) ASLR could not move a 32 bit windows module without relocation information, even if it wanted to.
Most of the above just makes sure that the function will have the same address on both sides. The only remaining question - at least that I can think of - is: is it safe to call a function via a pointer that we initialized by memcpy-ing over some bytes that we received from the network, assuming that the byte-pattern is the same that we would have gotten if we had taken the address of the desired function? That surely is something that the C++ standard doesn't guarantee, but I don't expect any real-world problems from current real-world compilers.
That being said, I would not recommend to do that, except for situations where security and robustness really aren't important.

How can you track memory across DLL boundaries

I want performant run-time memory metrics so I wrote a memory tracker based on overloading new & delete. It basically lets walk your allocations in the heap and analyze everything about them - fragmentation, size, time, number, callstack, etc. But, it has 2 fatal flaws: It can't track memory allocated in other DLLs and when ownership of objects is passed to DLLs or vice versa crashes ensue. And some smaller flaws: If a user uses malloc instead of new it's untracked; or if a user makes a class defined new/delete.
How can I eliminate these flaws? I think I must be going about this fundamentally incorrectly by overloading new/delete, is there a better way?
The right way to implement this is to use detours and a separate tool that runs in its own process. The procedure is roughly the following:
Create memory allocation in a remote process.
Place there code of a small loader that will load your dll.
Call CreateRemoteThread API that will run your loader.
From inside of the loaded dll establish detours (hooks, interceptors) on the alloc/dealloc functions.
Process the calls, track activity.
If you implement your tool this way, it will be not important from what DLL or directly from exe the memory allocation routines are called. Plus you can track activities from any process, not necessarily that you compiled yourself.
MS Windows allows checking contents of the virtual address space of the remote process. You can summarize use of virtual address space that was collected this way in a histogram, like the following:
From this picture you can see how many virtual allocation of what size are existing in your target process.
The picture above shows an overview of the virtual address space usage in 32-bit MSVC DevEnv. Blue stripe means a commited piece of emory, magenta stripe - reserved. Green is unoccupied part of the address space.
You can see that lower addresses are pretty fragmented, while the middle area - not. Blue lines at high addresses - various dlls that are loaded into the process.
You should find out the common memory management routines that are called by new/delete and malloc/free, and intercept those. It is usually malloc/free in the end, but check to make sure.
On UNIX, I would use LD_PRELOAD with some library that re-implemented those routines. On Windows, you have to hack a little bit, but this link seems to give a good description of the process. It basically suggests that you use Detours from Microsoft Research.
Passing ownership of objects between modules is fundamentally flawed. It showed up with your custom allocator, but there are plenty of other cases that will fail also:
compiler upgrades, and recompiling only some DLLs
mixing compilers from different vendors
statically linking the runtime library
Just to name a few. Free every object from the same module that allocated it (often by exporting a deletion function, such as IUnknown::Release()).