Rcpp: Recommended code structure when using data frames with Rcpp (inline) - c++

[I had this sketched out as a comment elsewhere but decided to create a proper question...]
What is currently considered "best practice" in terms of code structuring when using data frames in Rcpp? The ease with which one can "beam over" an input data frame from R to the C++ code is remarkable, but if the data frame has n columns, is the current thinking that this data should be split up into n separate (C++) vectors before being used?
The response to my previous question on making use of a string (character vector) column in a data frame suggests to me that yes, this is the right thing to do. In particular, there doesn't seem to be support for a notation such as df.name[i] to refer to the data frame information directly (as one might have in a C structure), unless I'm mistaken.
However, this leads us into a situation where subsetting down the data is much more cumbersome - instead of being able to subset a data frame in one line, each variable must be dealt with separately. So, is the thinking that subsetting in Rcpp is best done implicitly, via boolean vectors, say?
To summarise, I guess in a nutshell I wanted to check my current understanding that although a data frame can be beamed over to the C++ code, there is no way to refer directly to the individual elements of its columns in a "df.name[i]" fashion, and no simple method of generating a sub-dataframe of the input df by selecting rows satisfying simple criteria (e.g. df.date being in a given range).

Because data frames are in fact internally represented as list of vectors, the access by vectors really is the best you can do. There simply is no way to subset by row at the C or C++ level.
There was a good discussion about that on r-devel a few weeks ago in the context of a transpose of a data.frame (which you cannot do 'cheaply' for the same reason).

Related

What is the scope of result rows in PDI Kettle?

Working with result rows in kettle is the only way to pass lists internally in the program. But how does this work exactly? This topic has not been well documented and there's a lot of questions.
For example, a job containing 2 transformation can have result rows sent from the first to the second. But what if there's a third transformation getting the result rows? What is the scope? Can you pass result rows to a sub-job as well? Can you clear the result rows based on logic inside a transformation?
Working with lists and arrays is useful and necessary in programming, but confusing in PDI Kettle.
I agree that working with result rows may be confusing, but you can be confident: it works.
Yes, you can pass it the a sub-job, and in a series of sub-jobs (define the scope as "valid in the java machine" for the first test).
And no, there is no way to clear the results in a transformation (and certainly not based on a formula). That would mean a terrible overload in maintenance.
Kettle is not an imperative language, it is more of the data-flow family. It means it is nearer the way you are thinking when developing an ETL and much, much more performant. The drawback is that list and array have no meaning. Only flow of data.
And that is what is a result set : a flow of data, like the the result set of a sql query. The next job has to open it, pass each row to the transformation, and close it after the last row.

What's the most efficient way to store a subset of column indices of big matrix and in C++?

I am working with a very big matrix X (say, 1,000-by-1,000,000). My algorithm goes like following:
Scan the columns of X one by one, based on some filtering rules, to identify only a subset of columns that are needed. Denote the subset of indices of columns be S. Its size depends on the filter, so is unknown before computation and will change if the filtering rules are different.
Loop over S, do some computation with a column x_i if i is in S. This step needs to be parallelized with openMP.
Repeat 1 and 2 for 100 times with changed filtering rules, defined by a parameter.
I am wondering what the best way is to implement this procedure in C++. Here are two ways I can think of:
(a) Use a 0-1 array (with length 1,000,000) to indicate needed columns for Step 1 above; then in Step 2 loop over 1 to 1,000,000, use if-else to check indicator and do computation if indicator is 1 for that column;
(b) Use std::vector for S and push_back the column index if identified as needed; then only loop over S, each time extract column index from S and then do computation. (I thought about using this way, but it's said push_back is expensive if just storing integers.)
Since my algorithm is very time-consuming, I assume a little time saving in the basic step would mean a lot overall. So my question is, should I try (a) or (b) or other even better way for better performance (and for working with openMP)?
Any suggestions/comments for achieving better speedup are very appreciated. Thank you very much!
To me, it seems that "step #1 really does not matter much." (At the end of the day, you're going to wind up with: "a set of columns, however represented.")
To me, what's really going to matter is: "just what's gonna happen when you unleash ("parallelized ...") step #2.
"An array of 'ones and zeros,'" however large, should be fairly simple for parallelization, while a more-'advanced' data structure might well, in this case, "just get in the way."
"One thousand mega-bits, these days?" Sure. Done. No problem. ("And if not, a simple array of bit-sets.") However-many simultaneously executing entities should be able to navigate such a data structure, in parallel, with a minimum of conflict . . . Therefore, to my gut, "big bit-sets win."
I think you will find std::vector easier to use. Regarding push_back, the cost is when the vector reallocates (and maybe copies) the data. To avoid that (if it matters), you could set vector::capacity to 1,000,000. Your vector is then 8 MB, insignificant compared to your problem size. It's only 1 order magnitude bigger than a bitmap would be, and a lot simpler to deal with: If we call your vector S and the nth interesting column i, then your column access is just x[S[i]].
(Based on my gut feeling) I'd probably go for pushing back into a vector, but the answer is quite simple: Measure both methods (they are both trivial to implement). Most likely you won't see a noticeable difference.

Read/Sort a large .CSV File

So conceptually I'm reading in a file with ~2 million lines of data. I'm looking to sort, store and apply other functions to the data later.
I've been told this is referred to as "buckets" but I'm unclear whether this is something pre-defined or a user-defined data type. So I'm curious whether a linked list or array or another combination would be advisable?
Do I need to worry about the size of the file? Will most compilers be able to handle that all going on concurrently or will I need to partition the data first (i.e. divide into each bucket, store in its own file, then use another code, etc)?
If #2 is required, does C++ have the functionality to save multiple files per execution? Meaning a) create bucket1 file.txt; b) populate bucket1 file; close bucket1 file; d) create bucket2 file; ...
OK, so I gather from your post that you are writing this in C++. However, the details are a bit sparse apart from the sorting requirement. But what are you sorting on? Are all fields interpreted as text? Are some numbers? Are there multiple keys?
If you don't absolutely need to write this in C++, and you are on Linux, just invoke /bin/sort to do the sorting. This may seem like a cop-out, but commercial software like Talend even resorts to that.
But if you must write new code in C++, these are my recommendations:
1) Is the CSV file escaped? In other words, do embedded quotes and delimiters need special treatment? You have to figure this out first.
2) Check this out: http://mybyteofcode.blogspot.com/2010/02/parse-csv-file-with-boost-tokenizer-in.html
3) A simple representation of the scanned input is vector<vector<string> >. But it is unwieldy. Instead, wrap a class around vector<string> and make a vector of pointers to those classes, one per line of input, and sort those instead.
4) You should be able to sort ~2M "medium" rows in memory these days. Just use std::sort. But for full generality, you will need to consider, what if it doesn't fit into memory? The most common answer to this is to sort chunks at a time, write the results to disk, and then merge it all using a priority-queue or similar structure.

C++ efficiently extracting subsets from vector of user defined structure

Let me preface this with the statement that most of my background has been with functional programming languages so I'm fairly novice with C++.
Anyhow, the problem I'm working on is that I'm parsing a csv file with multiple variable types. A sample line from the data looks as such:
"2011-04-14 16:00:00, X, 1314.52, P, 812.1, 812"
"2011-04-14 16:01:00, X, 1316.32, P, 813.2, 813.1"
"2011-04-14 16:02:00, X, 1315.23, C, 811.2, 811.1"
So what I've done is defined a struct which stores each line. Then each of these are stored in a std::vector< mystruct >. Now say I want to subset this vector by column 4 into two vectors where every element with P in it is in one and C in the other.
Now the example I gave is fairly simplified, but the actual problem involves subsetting multiple times.
My initial naive implementation was iterate through the entire vector, creating individual subsets defined by new vectors, then subsetting those newly created vectors. Maybe something a bit more memory efficient would be to create an index, which would then be shrunk down.
Now my question is, is there a more efficient, in terms of speed/memory usage) way to do this either by this std::vector< mystruct > framework or if there's some better data structure to handle this type of thing.
Thanks!
EDIT:
Basically the output I'd like is first two lines and last line separately. Another thing worth noting, is that typically the dataset is not ordered like the example, so the Cs and Ps are not grouped together.
I've used std::partition for this. It's not part of boost though.
If you want a data structure that allows you to move elements between different instances cheaply, the data structure you are looking for is std::list<> and it's splice() family of functions.
I understand you have not per se trouble doing this but you seem to be concerned about memory usage and performance.
Depending on the size of your struct and the number of entries in the csv file it may be advisabe to use a smart pointer if you don't need to modify the partitioned data so the mystruct objects are not copied:
typedef std::vector<boost::shared_ptr<mystruct> > table_t;
table_t cvs_data;
If you use std::partition (as another poster suggested) you need to define a predicate that takes the indirection of the shared_ptr into accont.

What's a better way to store text for a word processor?

The usual way is to store the characters in a string, but because while writing a text, a lot of times the user deletes or adds characters in the middle of the text, perhaps it is better to use std::list<char> to contains the characters, then adding characters in the middle of list is not costly operation.
The following paper summarizes the data structures used in word processors: http://www.cs.unm.edu/~crowley/papers/sds.pdf
Data Structures for Text Sequences.
Charles Crowley, University of New Mexico, 1998
The data structure used to maintain the sequence of characters is an
important part of a text editor. This paper investigates and evaluates
the range of possible data structures for text sequences. The ADT
interface to the text sequence component of a text editor is examined.
Six common sequence data structures (array, gap, list, line pointers,
fixed size buers and piece tables) are examined and then a general
model of sequence data structures that encompasses all six structures
is presented. The piece table method is explained in detail and its
advantages are presented. The design space of sequence data structures
is examined and several variations on the ones listed above are
presented. These sequence data structures are compared experimentally
and evaluated based on a number of criteria. The experimental
comparison is done by implementing each data structure in an editing
simulator and testing it using a synthetic load of many thousands of
edits. We also report on experiments on the senstivity of the results
to variations in the parameters used to generate the synthetic editing
load.
First word-processing do much more than string manipulation. You will need a rich-text data structure. If you need pagination you will also need some meta-data like page setup. Do some research on Word, you will have answer.
For the rich-text part, your data structure have to save two things: characters and attributes. In other words, you have to have some kind of markup language. HTML/DOM is a choice. But in most of time it's a overkill because of complexity.
There are many data structure can handle the character part: Rope, Gap Buffer, and Piece Table. But none of them provide attribute support directly. You have to build it by you self.
AbiWord using list based Piece Table before, but now using tree based Piece Table now. Go to the Wiki page of AbiWord you will find more.
OpenOffice use a different way. Basically, it keeps a list of paragraph, and inside the paragraph it keep a string (or other more effective data structure) and list of attributes. I prefer this way because Paragraph is a naturally small enough unit to edit, it's much easier than tree based piece table.
SGI STL has a Rope class, you may want to check it out:
http://www.sgi.com/tech/stl/Rope.html
Using std::list<char> would require about nine times more storage per character than using std::string. That's probably not a good tradeoff. My first inclination would be to use a std::vector<std::string>, where each string object holds the text of a paragraph. Insertions and deletions within a paragraph will be fast enough.