I need to speed up data processing in R through C++. I already have my C++ code and it basically reads from txt file what R should pass. Since I need R for my analysis, I want to integrate my C++ code in R.
What the C++ code needs is a (large) dataframe (for which I use std::vector< std::vector> >) and a set of parameters, so I am thinking about passing parameters through .Call interface and then deal with data in the following way:
R: write data in txt file with a given encoding
C++: read from txt, do what I need to do and write the result in a txt (which is still a dataset -> std::vector)
R: read the result from txt
This would avoid me to rewrite part of the code. The possible problem/bottleneck is in reading/writing, do you believe it is a real problem?
Otherwise, as an alternative, is it reasonable to copy all my data in C++ structures through .Call interface?
Thank you.
You could start with the very simple DataFrame example in the RcppExamples package:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List DataFrameExample(const DataFrame & DF) {
// access each column by name
IntegerVector a = DF["a"];
CharacterVector b = DF["b"];
DateVector c = DF["c"];
// do something
a[2] = 42;
b[1] = "foo";
c[0] = c[0] + 7; // move up a week
// create a new data frame
DataFrame NDF = DataFrame::create(Named("a")=a,
Named("b")=b,
Named("c")=c);
// and return old and new in list
return List::create(Named("origDataFrame") = DF,
Named("newDataFrame") = NDF);
}
You can assign vectors (from either Rcpp or the STL) and matrices (again, either from Rcpp, or if you prefer nested STL vectors). And then you also have Eigen and Armadillo via RcppEigen and RcppArmadillo. And on and on -- there are over 1350 packages on CRAN you could study. And a large set of ready-to-run examples are at the Rcpp Gallery.
Reading and writing large datasets back and forth is not an optimal solution for passing the data between R and your C++ code. Depending on how long your C++ code executes this might or might not be the worst bottleneck in your code, but this approach should be avoided.
You can look a at the following solution to pass a data.frame (or data.table) object:
Passing a `data.table` to c++ functions using `Rcpp` and/or `RcppArmadillo`
As for passing additional parameters, the solution will depend on what kind of parameters we are talking about. If those are just numeric values, then you can pass them directly to C++ (see High performance functions with Rcpp: http://adv-r.had.co.nz/Rcpp.html).
Related
I have an algorithm implemented in MATLAB. I want to replace different built-in operations with my own C++ implementations. I do not want to use MEX for its extra overhead. Is there any way to provide variables as input arguments to my C++ executable function using the "system" command. For example, for a multiplier implemented in C++, the following works.
result = system('multiplier.exe 10 50')
The result is 500. But the following does not work:
a = 10;
b = 50;
result = system('multiplier.exe a b');
The result is always 0 in this case.
I have tried to use setenv and getenv functions. But the result is still 0.
Any help??
As hinted by user4581301 in a comment, your code is taking a and b as the strings a and b literally, and not translating them as 10 and 50 as expected by you. You need to put the values of a and b after converting them to strings instead. i.e.
result = system(['multiplier.exe ', num2str(a), ' ', num2str(b)]);
Matlab is passing them as string.
Trying to write your values to a file and reading from those file inside the said program might be a work around. But if your inputs/outputs are long and you want them to have appropriate data type, than you might want to filter the input and output streams (stdin stdout) from matlab and cpp to get the appropriate behavior.
I am working on a deep learning task in Lua and I need to import an image stream into Lua from C++. The C++ script is reading data from a shared memory and I have the pointer to the images in C++. I would like to get this pointer in Lua and give the image stream to my Neural Network.
Since Lua does not use pointers and as far as my research the way to go is to wrap this pointer into userdata and write accessor methods. As explained in this User Data With Pointer Example.
My problem is that I need to feed the neural network with raw pixel data and so far the material available online suggests that processing on data should be done in C++ and there is no way to get that object in Lua directly.
What should be the correct approach here? How can I get object values in Lua from pointer in C++?
Well, you have to make the data accessible to Lua in some way.
I can think of four general approaches:
converting the data
writing C(++) code that traverses the data to create a Lua data
structure made of tables & primitive values (booleans, numbers,
strings)
converting the data to an interchange format (JSON, CBOR, … or in the
case of image data:
PNM / PAM) for which
you have code on both sides or can trivially write it
(I won't say any more on these options here.)
passing it in as userdata and writing accessor functions, so you can write the conversion code in Lua instead of C(++)
(only for pointer-free data, and images usually are that) passing the raw data area to Lua as a string
(LuaJIT only) using ffi.cdef to declare C structs that match the in-memory format for free access & conversion (int, float, doesn't matter – can also work with pointers)
(2) is basically just (1a) with most of the code shifted to the Lua side.
(3) is the same with free accessor code (string.byte( data, i[, j]) and a bunch of others, newly added in 5.3: string.unpack(fmt, data[, startpos] ) ).
Before deciding on one of those approaches, look at what format the other side wants.
Maybe your data is already in that format, in which case you would only have to route through the pointer in some way. (Maybe there's a function accepting a Lua table, but maybe there's also one that accepts a string or pointer to raw image data.) If this applies, use (light) userdata or strings for communication. (This will probably be the fastest, and if this way of passing the data doesn't exist yet consider asking upstream whether they can add it.)
Failing that, you'll have to actually touch the data. The string version is potentially the easiest, but also likely to be the slowest. (If the data is one contiguous block of bytes, you can simply lua_pushlstring( L, ptr, len_in_bytes ) to get everything into Lua, but that creates a copy of the "string" internal to the Lua state which might be relatively slow.) After that, you can use the standard string handling functionality of Lua.
If you're using LuaJIT, it's probably a very good idea to use the ffi library. A very brief outline of what you may need (using a linked list as example):
ffi = require "ffi"
-- sample structure, yours will differ
ffi.cdef [[
typedef struct Node_s { struct Node_s *next; int value; } Node;
]]
-- in your case you will get a pointer from C, which you can then
-- ffi.cast( "Node*", ptr )
-- after that, you can access the fields (ptr.next, ptr.value etc.)
-- we'll just use some dummy sample data here:
do
local p = ffi.new( "Node" )
local q = ffi.new( "Node" )
p.next = q
p.value, p.next.value = 23, 42
ptr = tonumber( tostring( p ):match( ": (0x%x*)" ) )
end
data = ffi.cast( "Node*", ptr )
print( "first value:", data.value )
--> first value: 23
print( "second value:", data.next.value )
--> second value: 42
print( "third value:", data.next.next.value )
--> Segmentation fault [so be just as careful as you're in C(++)!]
Lastly, if you're on vanilla Lua, you may want to use (full) userdata and manually written accessors to avoid the speed loss from data copying. (The full userdata is required so you can set a metatable to get the accessors, but it can simply be a pointer to your externally allocated data.) The book "Programming in Lua" has good information on that (first edition is online, direct link to the relevant chapter.)
Just for learning purposes, I'm writing a C++ program which has a GUI interface and contains a library of assorted functions, solvers etc. A user defines it's own set of instructions in a textfield. then this file would be parsed in a way the the underlying library will understand.
How would this work? would I parse the input into some form of machine level code? Can somebody provide a manner in which this can be organised from a C++ perspective or otherwise? I would find it inefficient for the underlying library itself would provide a set of parsing instructions.
So for example, if the user input this
's1 = 2;
s2 = 7;
s3 = 9;
x1' = s1 * x2 + s2;
x2' = s3 * x1'
I would generate something like this
constexpr double s1 = 2;
constexpr double s2 = 7;
constexpr double s3 = 9;
void ode_system(std::vector<double>& x_, std::vector<double>& y_)
{
y_[0] = s1 * x_[1] + s2;
y_[1] = s3 * x_[0];
}
how would this be compiled and linked with the libraries in a stand alone GUI if this is the correct way?
If not, what is the best way to achieve what I am doing?
I ask this to save me from undeniable hell if I go down a wrong path! also for others as a reference
As an example, you have programs like MATLAB (with a Java interface be it that) will parse your script into some object so it can use its built in library functions
As you do not say exactly what the syntax of the script file is, it is hard to give you a precise advice. For what I understand of your question, I think you should :
define the precise syntax of the input script file
use a parser generator such as lex + yacc or its variant flex + bison to build routines to parse the file and convert it to I cannot say before you say it first (*)
define the precise syntax of input GUI and the way you want to use the elements of the script
if it is simple enough parse it directly or use again lex+yacc to build a parser
glue all that together
(*) : you do not give enough elements to allow a precise answer - depending of the complexity of the expressions you want to process, you can simply have a set of routines that take their parameters from the input script : that's more of less what you called convert it to but I do not go down to machine level code
On the other hand if it is really complex or highly versatile, you can fully embed an interpreted language such as Python and have it directly execute the scripts with data coming from the GUI parser
I know SO is not rent-a-coder, but I have a really simple python example that I need help translating to C++
grey_image_as_array = numpy.asarray( cv.GetMat( grey_image ) )
non_black_coords_array = numpy.where( grey_image_as_array > 3 )
# Convert from numpy.where()'s two separate lists to one list of (x, y) tuples:
non_black_coords_array = zip( non_black_coords_array[1], non_black_coords_array[0] )
First one is rather simple I guess - a linear indexable array is created with what bytes are retruned from cv.GetMat, right?
What would be an equivalent of pyton's where and especially this zip functions?
I don't know about OpenCV, so I can't tell you what cv.GetMat() does. Apparently, it returns something that can be used as or converted to a two-dimensional array. The C or C++ interface to OpenCV that you are using will probably have a similarly names function.
The following lines create an array of index pairs of the entries in grey_image_as_array that are bigger than 3. Each entry in non_black_coords_array are zero based x-y-coordinates into grey_image_as_array. Given such a coordinates pair x, y, you can access the corresponsing entry in the two-dimensional C++ array grey_image_as_array with grey_image_as_array[y][x].
The Python code has to avoid explicit loops over the image to achieve good performance, so it needs to make to with the vectorised functions NumPy offers. The expression grey_image_as_array > 3 is a vectorised comparison and results in a Boolean array of the same shape as grey_image_as_array. Next, numpy.where() extracts the indices of the True entries in this Boolean array, but the result is not in the format described above, so we need zip() to restructure it.
In C++, there's no need to avoid explicit loops, and an equivalent of numpy.where() would be rather pointless -- you just write the loops and store the result in the format of your choice.
I'm wondering if there is a library like Boost Format, but which supports named parameters rather than positional ones. This is a common idiom in e.g. Python, where you have a context to format strings with that may or may not use all available arguments, e.g.
mouse_state = {}
mouse_state['button'] = 0
mouse_state['x'] = 50
mouse_state['y'] = 30
#...
"You clicked %(button)s at %(x)d,%(y)d." % mouse_state
"Targeting %(x)d, %(y)d." % mouse_state
Are there any libraries that offer the functionality of those last two lines? I would expect it to offer a API something like:
PrintFMap(string format, map<string, string> args);
In Googling I have found many libraries offering variations of positional parameters, but none that support named ones. Ideally the library has few dependencies so I can drop it easily into my code. C++ won't be quite as idiomatic for collecting named arguments, but probably someone out there has thought more about it than me.
Performance is important, in particular I'd like to keep memory allocations down (always tricky in C++), since this may be run on devices without virtual memory. But having even a slow one to start from will probably be faster than writing it from scratch myself.
The fmt library supports named arguments:
print("You clicked {button} at {x},{y}.",
arg("button", "b1"), arg("x", 50), arg("y", 30));
And as a syntactic sugar you can even (ab)use user-defined literals to pass arguments:
print("You clicked {button} at {x},{y}.",
"button"_a="b1", "x"_a=50, "y"_a=30);
For brevity the namespace fmt is omitted in the above examples.
Disclaimer: I'm the author of this library.
I've always been critic with C++ I/O (especially formatting) because in my opinion is a step backward in respect to C. Formats needs to be dynamic, and makes perfect sense for example to load them from an external resource as a file or a parameter.
I've never tried before however to actually implement an alternative and your question made me making an attempt investing some weekend hours on this idea.
Sure the problem was more complex than I thought (for example just the integer formatting routine is 200+ lines), but I think that this approach (dynamic format strings) is more usable.
You can download my experiment from this link (it's just a .h file) and a test program from this link (test is probably not the correct term, I used it just to see if I was able to compile).
The following is an example
#include "format.h"
#include <iostream>
using format::FormatString;
using format::FormatDict;
int main()
{
std::cout << FormatString("The answer is %{x}") % FormatDict()("x", 42);
return 0;
}
It is different from boost.format approach because uses named parameters and because
the format string and format dictionary are meant to be built separately (and for
example passed around). Also I think that formatting options should be part of the
string (like printf) and not in the code.
FormatDict uses a trick for keeping the syntax reasonable:
FormatDict fd;
fd("x", 12)
("y", 3.141592654)
("z", "A string");
FormatString is instead just parsed from a const std::string& (I decided to preparse format strings but a slower but probably acceptable approach would be just passing the string and reparsing it each time).
The formatting can be extended for user defined types by specializing a conversion function template; for example
struct P2d
{
int x, y;
P2d(int x, int y)
: x(x), y(y)
{
}
};
namespace format {
template<>
std::string toString<P2d>(const P2d& p, const std::string& parms)
{
return FormatString("P2d(%{x}; %{y})") % FormatDict()
("x", p.x)
("y", p.y);
}
}
after that a P2d instance can be simply placed in a formatting dictionary.
Also it's possible to pass parameters to a formatting function by placing them between % and {.
For now I only implemented an integer formatting specialization that supports
Fixed size with left/right/center alignment
Custom filling char
Generic base (2-36), lower or uppercase
Digit separator (with both custom char and count)
Overflow char
Sign display
I've also added some shortcuts for common cases, for example
"%08x{hexdata}"
is an hex number with 8 digits padded with '0's.
"%026/2,8:{bindata}"
is a 24-bit binary number (as required by "/2") with digit separator ":" every 8 bits (as required by ",8:").
Note that the code is just an idea, and for example for now I just prevented copies when probably it's reasonable to allow storing both format strings and dictionaries (for dictionaries it's however important to give the ability to avoid copying an object just because it needs to be added to a FormatDict, and while IMO this is possible it's also something that raises non-trivial problems about lifetimes).
UPDATE
I've made a few changes to the initial approach:
Format strings can now be copied
Formatting for custom types is done using template classes instead of functions (this allows partial specialization)
I've added a formatter for sequences (two iterators). Syntax is still crude.
I've created a github project for it, with boost licensing.
The answer appears to be, no, there is not a C++ library that does this, and C++ programmers apparently do not even see the need for one, based on the comments I have received. I will have to write my own yet again.
Well I'll add my own answer as well, not that I know (or have coded) such a library, but to answer to the "keep the memory allocation down" bit.
As always I can envision some kind of speed / memory trade-off.
On the one hand, you can parse "Just In Time":
class Formater:
def __init__(self, format): self._string = format
def compute(self):
for k,v in context:
while self.__contains(k):
left, variable, right = self.__extract(k)
self._string = left + self.__replace(variable, v) + right
This way you don't keep a "parsed" structure at hand, and hopefully most of the time you'll just insert the new data in place (unlike Python, C++ strings are not immutable).
However it's far from being efficient...
On the other hand, you can build a fully constructed tree representing the parsed format. You will have several classes like: Constant, String, Integer, Real, etc... and probably some subclasses / decorators as well for the formatting itself.
I think however than the most efficient approach would be to have some kind of a mix of the two.
explode the format string into a list of Constant, Variable
index the variables in another structure (a hash table with open-addressing would do nicely, or something akin to Loki::AssocVector).
There you are: you're done with only 2 dynamically allocated arrays (basically). If you want to allow a same key to be repeated multiple times, simply use a std::vector<size_t> as a value of the index: good implementations should not allocate any memory dynamically for small sized vectors (VC++ 2010 doesn't for less than 16 bytes worth of data).
When evaluating the context itself, look up the instances. You then parse the formatter "just in time", check it agaisnt the current type of the value with which to replace it, and process the format.
Pros and cons:
- Just In Time: you scan the string again and again
- One Parse: requires a lot of dedicated classes, possibly many allocations, but the format is validated on input. Like Boost it may be reused.
- Mix: more efficient, especially if you don't replace some values (allow some kind of "null" value), but delaying the parsing of the format delays the reporting of errors.
Personally I would go for the One Parse scheme, trying to keep the allocations down using boost::variant and the Strategy Pattern as much I could.
Given that Python it's self is written in C and that formatting is such a commonly used feature, you might be able (ignoring copy write issues) to rip the relevant code from the python interpreter and port it to use STL maps rather than Pythons native dicts.
I've writen a library for this puporse, check it out on GitHub.
Contributions are wellcome.