Boost.Coroutine not using segmented stacks - c++

Can anyone give me an example of how I can use segmented stacks with boost coroutines? Do I have to annotate every function that is called from the coroutine with a special split-stack attribute?
When I try and write a program that should use segmented stacks, it just segfaults.
Here is what I have done so far
https://wandbox.org/permlink/TltQwGpy4hRoHgDY The code seems to segfault very quickly, if segmented stacks were used I would expect it to be able to handle more iterations. The program errors out after 35 iterations.
#include <boost/coroutine2/all.hpp>
#include <iostream>
#include <array>
using std::cout;
using std::endl;
class Int {
int a{2};
};
void foo(int num) {
cout << "In iteration " << num << endl;
std::array<Int, 1000> arr;
static_cast<void>(arr);
foo(num + 1);
}
int main() {
using Coroutine_t = boost::coroutines2::coroutine<int>::push_type;
auto coro = Coroutine_t{[&](auto& yield) {
foo(yield.get());
}};
coro(0);
}

Compiling that code with -fsplit-stack solves the problem. Annotations are not required. All functions are by default treated as split stacks. Example - https://wandbox.org/permlink/Pzzj5gMoUAyU0h7Q
Easy as that.

compile boost (boost.context and boost.coroutine) with b2 property segmented-stacks=on (enables special code inside boost.coroutine and boost.context).
your app has to be compiled with -DBOOST_USE_SEGMENTED_STACKS and -fsplit-stack (required by boost.coroutines headers).
see documentation: http://www.boost.org/doc/libs/1_65_1/libs/coroutine/doc/html/coroutine/stack/segmented_stack_allocator.html
boost.coroutine contains an example that demonstrates segmented stacks
(in directory coroutine/example/asymmetric/ call b2 toolset=gcc segmented-stacks=on).
please note: while llvm supports segmented stacks, clang seams not to provide the __splitstack_<xyz> functions.

Related

Does C++ have a way to do Cuda style kernel templates, where parameters produce separate compilations?

In Cuda you can specify template parameters that are used to automatically create completely different versions of kernels. The catch is that you can only pass const values to the functions so that the compiler knows ahead of time exactly which versions of the kernel need to be created. For instance, you can have a template parameter int X, then use an if(X==4){this}else{that} and you'll get two separate functions created, neither of which have the overhead of the 'if' statement.
I've found this to be invaluable in allowing great flexibility and code re-usability without sacrificing performance.
Bonus points if you can point out that branches don't have that much overhead, I never knew that! ;)
Something like this?
#include <iostream>
template <int x>
void function() {
if constexpr (x == 1) {
std::cout << "hello\n";
} else {
std::cout << "world\n";
}
}
int main() {
function<3>();
}

Given a call stack containing a lambda function, how can one determine its source?

Suppose you have some code that pushes onto a queue like this:
template <typename T>
void submitJobToPool(T callable)
{
someJobQueue.push(callable)
}
...and later on:
template <typename T>
void runJobFromPool(T callable)
{
auto job = someJobQueue.pop();
job();
}
Now imagine that the code crashes due to some error inside of the job() call. If the submitted job was a normal function, the call stack might look something like this:
void myFunction() 0x345678901
void runJobFromPool() 0x234567890
int main(int, char**) 0x123456789
It's easy to see what function crashed here. If it's a functor, it'll be similar but with an operator() in there somewhere (ignoring inlining). However, for a lambda...
void lambda_a7009ccf8810b62b59083b4c1779e569() 0x345678901
void runJobFromPool() 0x234567890
int main(int, char**) 0x123456789
This is not so easy to debug. If there's a debugger attached when it happens, or a core dump available, then that information can be used to derive which lambda crashed, but that information is not always available. As far as I know, disassembly is one of the few ways to determine what crashed from this.
The ideas I've had to make this better are:
Using a tool like addr2line if the platform supports it. This sometimes works, sometimes not.
Wrapping up all lambdas in functors (not ideal, to say the least).
Not using lambdas (again, not ideal).
Using a compiler extension to give the lambda a more meaningful name / add debugging info.
The 4th option sounded promising, so I did some investigation, but couldn't find anything. In case it matters, the compilers I have available are clang++ 5.0 and MSVC 19 (Visual Studio 2015).
My question is, what other tools / techniques are available that can help map a callstack with a lambda function in it to the corresponding source line?
I am afraid it is not possible. You should design your own technique how to store required information in lamdas. Your option 2 is suitable here. You may look how does it Google: https://cs.chromium.org/chromium/src/base/task_scheduler/post_task.h
Below is very raw approach (https://ideone.com/OFCgAq)
#include <iostream>
#include <stack>
#include <functional>
std::stack<std::function<void(void)>> someJobQueue;
template <typename T>
void submitJobToPool(std::string from_here, T callable) {
someJobQueue.push(std::bind([callable](std::string from_here) { callable(); }, from_here));
}
void runJobFromPool() {
auto job = someJobQueue.top();
someJobQueue.pop();
job();
}
int main() {
submitJobToPool(__func__, [](){ std::cout << "It's me." << std::endl; });
runJobFromPool();
return 0;
}
Unfortunately you will not see a perfect call stack. But you can see from_here in a debugger.
void lambda_1a7009ccf8810b62b59083b4c1779e56() 0x345678920
void lambda_a7009ccf8810b62b59083b4c1779e569() 0x345678910 <-- Here `from_here` will be available: "main"
void runJobFromPool() 0x234567890
int main(int, char**) 0x123456780
One technique is to create a struct (ideally via a script) that has multiple functions (one for each lambda) and call the lambda via this struct strictly. These functions could just take this lambda and execute it.
This struct method will show up in the crash logs and the stack trace.
struct LambdaCrashLogs {
Lambda1(std::function<void(void)> job) {
job();
}
};
The main approach will be to write a pre-commit script that generates this struct automatically.

What're all the LLVM layers for?

I'm playing with LLVM 3.7 and wanted to use the new ORC stuff. But I've been going at this for a few hours now and still don't get what the each layer is for, when to use them, how to compose them or at the very least the minimum set of things I need in place.
Been through the Kaleidoscope tutorial but these don't explain what the constituent parts are, just says put this here and this here (plus the parsing etc distracts from the core LLVM bits). While that's great to get started it leaves a lot of gaps. There are lots of docs on various things in LLVM but there's so much its actually bordering on overwhelming. Stuff like http://llvm.org/releases/3.7.0/docs/ProgrammersManual.html but I can't find anything that explains how all the pieces fit together. Even more confusing there seems to be multiple APIs for doing the same thing, thinking of the MCJIT and the newer ORC API. I saw Lang Hames post explaining, a fair few things seem to have changed since the patch he posted in that link.
So for a specific question, how do all these layers fit together?
When I previously used LLVM I could link to C functions fairly easily, using the "How to use JIT" example as a base, I tried linking to an externed function extern "C" double doIt but end up with LLVM ERROR: Tried to execute an unknown external function: doIt.
Having a look at this ORC example it seems I need to configure where it searches for the symbols. But TBH while I'm still swinging at this, its largely guess work. Here's what I got:
#include "llvm/ADT/STLExtras.h"
#include "llvm/ExecutionEngine/GenericValue.h"
#include "llvm/ExecutionEngine/Interpreter.h"
#include "llvm/IR/Constants.h"
#include "llvm/IR/DerivedTypes.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/Module.h"
#include "llvm/Support/ManagedStatic.h"
#include "llvm/Support/TargetSelect.h"
#include "llvm/Support/raw_ostream.h"
#include "std.hpp"
using namespace llvm;
int main() {
InitializeNativeTarget();
LLVMContext Context;
// Create some module to put our function into it.
std::unique_ptr<Module> Owner = make_unique<Module>("test", Context);
Module *M = Owner.get();
// Create the add1 function entry and insert this entry into module M. The
// function will have a return type of "int" and take an argument of "int".
// The '0' terminates the list of argument types.
Function *Add1F = cast<Function>(M->getOrInsertFunction("add1", Type::getInt32Ty(Context), Type::getInt32Ty(Context), (Type *) 0));
// Add a basic block to the function. As before, it automatically inserts
// because of the last argument.
BasicBlock *BB = BasicBlock::Create(Context, "EntryBlock", Add1F);
// Create a basic block builder with default parameters. The builder will
// automatically append instructions to the basic block `BB'.
IRBuilder<> builder(BB);
// Get pointers to the constant `1'.
Value *One = builder.getInt32(1);
// Get pointers to the integer argument of the add1 function...
assert(Add1F->arg_begin() != Add1F->arg_end()); // Make sure there's an arg
Argument *ArgX = Add1F->arg_begin(); // Get the arg
ArgX->setName("AnArg"); // Give it a nice symbolic name for fun.
// Create the add instruction, inserting it into the end of BB.
Value *Add = builder.CreateAdd(One, ArgX);
// Create the return instruction and add it to the basic block
builder.CreateRet(Add);
// Now, function add1 is ready.
// Now we're going to create function `foo', which returns an int and takes no
// arguments.
Function *FooF = cast<Function>(M->getOrInsertFunction("foo", Type::getInt32Ty(Context), (Type *) 0));
// Add a basic block to the FooF function.
BB = BasicBlock::Create(Context, "EntryBlock", FooF);
// Tell the basic block builder to attach itself to the new basic block
builder.SetInsertPoint(BB);
// Get pointer to the constant `10'.
Value *Ten = builder.getInt32(10);
// Pass Ten to the call to Add1F
CallInst *Add1CallRes = builder.CreateCall(Add1F, Ten);
Add1CallRes->setTailCall(true);
// Create the return instruction and add it to the basic block.
builder.CreateRet(Add1CallRes);
std::vector<Type *> args;
args.push_back(Type::getDoubleTy(getGlobalContext()));
FunctionType *FT = FunctionType::get(Type::getDoubleTy(getGlobalContext()), args, false);
Function *F = Function::Create(FT, Function::ExternalLinkage, "doIt", Owner.get());
// Now we create the JIT.
ExecutionEngine *EE = EngineBuilder(std::move(Owner)).create();
outs() << "We just constructed this LLVM module:\n\n" << *M;
outs() << "\n\nRunning foo: ";
outs().flush();
// Call the `foo' function with no arguments:
std::vector<GenericValue> noargs;
GenericValue gv = EE->runFunction(FooF, noargs);
auto ax = EE->runFunction(F, noargs);
// Import result of execution:
outs() << "Result: " << gv.IntVal << "\n";
outs() << "Result 2: " << ax.IntVal << "\n";
delete EE;
llvm_shutdown();
return 0;
}
doIt is declared in std.hpp.
Your question is very vague, but maybe I can help a bit. This code sample is a simple JIT built with Orc - it's well commented so it should be easy to follow.
Put simply, Orc builds on top of the same building blocks used by MCJIT (MC for compiling LLVM modules down to object files, RuntimeDyld for the dynamic linking at runtime), but provides more flexibility with its concept of layers. It can thus support things like "lazy" JIT compilation, which MCJIT doesn't support. This is important for the LLVM community because the "old JIT" that was removed not very long ago supported these things. Orc JIT lets us gain back these advanced JIT capabilities while still building on top of MC and thus not duplicating the code emission logic.
To get better answers, I suggest you ask more specific questions.

Perfect hash function for strings known in advance

I have 4000 strings and I want to create a perfect hash table with these strings. The strings are known in advance, so my first idea was to use a series of if statements:
if (name=="aaa")
return 1;
else if (name=="bbb")
return 2;
.
.
.
// 4000th `if' statement
However, this would be very inefficient. Is there a better way?
gperf is a tool that does exactly that:
GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
According to the documentation, gperf is used to generate the reserved keyword recogniser for lexers in GNU C, GNU C++, GNU Java, GNU Pascal, GNU Modula 3, and GNU indent.
The way it works is described in GPERF: A Perfect Hash Function Generator by Douglas C. Schmidt.
Better later than never, I believe this now finally answers the OP question:
Simply use https://github.com/serge-sans-paille/frozen -- a Compile-time (constexpr) library of immutable containers for C++ (using "perfect hash" under the hood).
On my tests, it performed in pair with the famous GNU's gperf perfect hash C code generator.
On your pseudo-code terms:
#include <frozen/unordered_map.h>
#include <frozen/string.h>
constexpr frozen::unordered_map<frozen::string, int, 2> olaf = {
{"aaa", 1},
{"bbb", 2},
.
.
.
// 4000th element
};
return olaf.at(name);
Will respond in O(1) time rather than OP's O(n)
-- O(n) assuming the compiler wouldn't optimize your if chain, which it might do)
Since the question is still unanswered and I'm about to add the same functionality to my HFT platform, I'll share my inventory for Perfect Hash Algorithms in C++. It is harder than I thought to find an open, flexible and bug free implementation, so I'm sharing the ones I didn't drop yet:
The CMPH library, with a collection of papers and such algorithms -- https://git.code.sf.net/p/cmph/git
BBHash, one more implementation from a paper's author -- https://github.com/rizkg/BBHash
Ademakov's -- another implementation from the paper above -- https://github.com/ademakov/PHF
wahern/phf -- I'm currently inspecting this one and trying to solve some allocation bugs it has when dealing with C++ Strings on huge key sets -- https://github.com/wahern/phf.git
emphf -- seems unmantained -- https://github.com/ot/emphf.git
I believe #NPE's answer is very reasonable, and I doubt it is too much for your application as you seem to imply.
Consider the following example: suppose you have your "engine" logic (that is: your application's functionality) contained in a file called engine.hpp:
// this is engine.hpp
#pragma once
#include <iostream>
void standalone() {
std::cout << "called standalone" << std::endl;
}
struct Foo {
static void first() {
std::cout << "called Foo::first()" << std::endl;
}
static void second() {
std::cout << "called Foo::second()" << std::endl;
}
};
// other functions...
and suppose you want to dispatch the different functions based on the map:
"standalone" dispatches void standalone()
"first" dispatches Foo::first()
"second" dispatches Foo::second()
# other dispatch rules...
You can do that using the following gperf input file (I called it "lookups.gperf"):
%{
#include "engine.hpp"
struct CommandMap {
const char *name;
void (*dispatch) (void);
};
%}
%ignore-case
%language=C++
%define class-name Commands
%define lookup-function-name Lookup
struct CommandMap
%%
standalone, standalone
first, Foo::first
second, Foo::second
Then you can use gperf to create a lookups.hpp file using a simple command:
gperf -tCG lookups.gperf > lookups.hpp
Once I have that in place, the following main subroutine will dispatch commands based on what I type:
#include <iostream>
#include "engine.hpp" // this is my application engine
#include "lookups.hpp" // this is gperf's output
int main() {
std::string command;
while(std::cin >> command) {
auto match = Commands::Lookup(command.c_str(), command.size());
if(match) {
match->dispatch();
} else {
std::cerr << "invalid command" << std::endl;
}
}
}
Compile it:
g++ main.cpp -std=c++11
and run it:
$ ./a.out
standalone
called standalone
first
called Foo::first()
Second
called Foo::second()
SECOND
called Foo::second()
first
called Foo::first()
frst
invalid command
Notice that once you have generated lookups.hpp your application has no dependency whatsoever in gperf.
Disclaimer: I took inspiration for this example from this site.

64bit and 32bit process intercommunication boost::message_queue

Good day folks,
I am currently trying to figure a way to pass data between a 64bit process and a 32bit process. Since it's a real-time application and both are running on the same computer I tough using shared memory (shm).
While I was looking for some synchronization mechanism using shm, I felt on boost::message_queue. However it's not working.
My code is basically the following:
Sender part
message_queue::remove("message_queue");
message_queue mq(create_only, "message_queue", 100, sizeof(uint8_t));
for (uint8_t i = 0; i < 100; ++i)
{
mq.send(&i, sizeof(uint8_t), 0);
}
Receiver part
message_queue mq(open_only, "message_queue");
for (uint8_t i = 0; i < 100; ++i)
{
uint8_t v;
size_t rsize;
unsigned int rpriority;
mq.receive(&v, sizeof(v), rsize, rpriority);
std::cout << "v=" << (int) v << ", esize=" << sizeof(uint8_t) << ", rsize=" << rsize << ", rpriority=" << rpriority << std::endl;
}
This code works perfectly if the two process are 64bit or 32bit. But doesn't work if the two process aren't the same.
Looking deeper in boost (1.50.0) code you'll see the following line in message_queue_t::do_receive (boost/interprocess/ipc/message_queue.hpp):
scoped_lock lock(p_hdr->m_mutex);
For some reason, in the mutex seems to be locked when working with heterogeneous processes. My wild guess would be that the mutex is offsetted and therefore it's value is corrupted but I'm not quite sure.
Am I trying to accomplish something that is simply not supported?
Any help or advice will be appreciated.
I think this is about the portability of offset_ptr used in message_queue to point to each message, including the header mutex. 32-/64-bit interoperability should be supported since Boost 1.48.0, as addressed in https://svn.boost.org/trac/boost/ticket/5230.
Following the ticket suggestion, the following definition has (so far) worked fine for me in leiu of message_queue:
typedef message_queue_t< offset_ptr<void, int32_t, uint64_t> > interop_message_queue;
On Boost 1.50.0 under MSVC this also seems to require a small patch in message_queue.hpp to resolve template ambiguity: casting arguments in calls to ipcdetail::get_rounded_size(...).
I've spent my whole workday figure out the solution and finally I made it work. The solution is partially what James provided, so I used the interop_message_queue on both the 32-bit and the 64-bit processes.
typedef boost::interprocess::message_queue_t< offset_ptr<void, boost::int32_t, boost::uint64_t>> interop_message_queue;
The problem is that with this modification the code wouldn't compile, so I also had to add the following, which I found on the boost bug report list (#6147: message_queue sample fails to compile in 32-bit), this code has to be place before the boost includes of the message_queue:
namespace boost {
namespace interprocess {
namespace ipcdetail {
//Rounds "orig_size" by excess to round_to bytes
template<class SizeType, class ST2>
inline SizeType get_rounded_size(SizeType orig_size, ST2 round_to) {
return ((orig_size-1)/round_to+1)*round_to;
}
}
}
}