R data frame and which.min() equivalent in c++

R data frame and which.min() equivalent in c++ - c++

I'm translating R code to c++ and I'd like to find an equivalent (optimal) structure which would allow the same kind of operations than a data frame, but in c++.
The operations are :
add elements (rows)
remove elements (rows) from index
get the index of the lowest value
e.g. :
a <- data.frame(i = c(4, 9, 3, 1, 8, 2, 7, 10, 6, 6),
j = c(8, 8, 8, 4, 3, 9, 1, 4, 8, 9) ,
v = c(1.9, 18, 1.3, 17, 1.5, 14, 11, 1.4, 18, 2.0),
o = c(3, 3, 3, 3, 1, 2, 1, 2, 3, 3))
a[which.min(a$v), c('i', 'j')] # find lowest v value and get i,j value
a <- a[-which.min(a$v)] # remove row from index
a <- cbind(a, data.frame(i = 3, j = 9, v = 2, o = 2)) # add a row
As I'm using Rcpp, Rcpp::DataFrame might be an option (I don't know how I would which.min it however), but I guess it's quite slow for the task as these operations need to be repeated a lot and I don't need to ship it back to R.
EDIT:
Target. Just to make it clear the goal here is to gain speed. It is the obvious reason why one would translate code from R to C++ (there might be others, that's why I clarify). However, maintenance and easy implementation comes second.
More precision on the operations. The algorithm is: add lots of data to the array (multiple lines), then extract the lowest value and delete it. Repeat.
That's why I wouldn't go for a sorted vector, but instead always search the lowest data on demand as the array is updated (addition) frequently. I think it's faster, but maybe I'm wrong.

I think a vector of vectors should do what you want. You would need to implement the min-finding manually (two nested loops), which is the fastest you can do without adding overhead.
You can speed up the min-finding by keeping track of the position of the smallest element in each row along with the row.

This question is a bit stale, but I thought I would offer some general observations pertaining to this kind of task.
If you are keeping the collection of rows in an ordered state, which might be an assumption of your which.min strategy, the most difficult operation to support efficiently is row insert, if this is a common operation. You'd be hard pressed not to use a list<> data structure, with the likely consequence that which.min turns into a linear operation, since lists aren't great for bisection search.
If you keep an unordered collection, you can deal with deletes by copying records off the end of the frame to the row vacated by the deletion, and subtracting 1 from your row count. Alternatively, you can just flag deletions with another vector of bool, until the delete count hits a threshold such as sqrt(N), at which time you perform a coalescence copy. You'll come out better than amortized O(N^2) for insert/delete, but which.min will be a linear search through the entire vector each time.
The normal thing to do when you need to identify the min/max element as a common operation is to employ a priority queue of some kind, sometimes duplicating the data for the column indexed. Off the top of my head, it would be tricky to synchronize a priority queue over one data column with rows of a data frame that are moving around as a result of delete operations in the non-list implementation.
If the rows are merely marked as deleted, the priority queue would stay in sync (though you would have to discard elements popped off the queue corresponding to subsequently deleted rows until you get a good one); after the coalescence copy, you would re-index the priority queue, which is pretty fast if you're not doing it too often. Usually if you had enough memory to grow the structure to a large size, you're not overly pressed to give the memory back again if the structure shrinks; it's not obvious you ever need to coalesce if your structure tends to persist at a size anywhere near the high water mark, but beware of the case where your priority queue has both expired and fresh references to the same storage row, because you wrote new data to a row previously deleted. For efficiency, sometimes you end up using an auxiliary list to keep track of rows marked deleted so you can find storage for inserted rows in less than linear time.
It can be hard to extract stale items from the bowels of a priority queue, since these tend to be designed only for removal at the top of the queue; often you have to leave the stale items in there and arrange to ignore them if they surface at some later time.
When you get into C++ with performance objectives, there are many ways to skin the cat, and you need to be far more precise about performance trade-offs than what the original R code expressed to obtain good execution time for all required operations.

A data.frame is really just a list of vectors. In C++, we really only have those lists of vectors which makes adding rows hard.
Idem for removing rows -- as Rcpp works on the original R representation you always need copy all remaining values.
As for the which.min() equivalent: I think that come up once on the list and you can do something simple with STL idioms. I don't recall us having that in the API.

An R data frame in C++ terms is a container of objects (An R matrix might be a vector of vectors, but if you care about efficiency you are unlikely to implement it that way.)
So, represent the data frame with this class:
class A{
public:
int i,j,o;
double v;
public:
A(int i_,int j_,int v_,int o_):i(i_),j(j_),v(v_),o(o_){}
}
And prepare this algorithm parameter function to help find the minimum:
bool comp(A &x,A &y){
return x.v<y.v;
}
(In production code I'd more likely use a functor object (see Meyer's Effective STL, item 46), or boost lambda or, best of all, C++0x's lambdas)
And then this code body:
std::vector<A> a;
a.push_back(A(4,8,1.9,3));
a.push_back(A(9,8,18,3));
a.push_back(A(1,4,1.3,3));
//...
std::vector<A>::iterator lowest=std::min_element(a.begin(),a.end(),comp);
std::cout<< lowest->i << ',' << lowest->j <<"\n";
a.erase(lowest);
a.push_back( A(3,9,2,0) );
Depending on what you are really doing, it may be more efficient to sort a first, greatest first. Then if you wish to erase the lowest item(s) you simply truncate the vector.
If you are actually deleting all over the place, and which.min() was just for the sake of example, you may find a linked list more efficient.

No,
A data.frame is a bit more complex than a vector of vector.
I might say the simpliest design for speed in all case is to store each columns in a typed vector and create a list as header for rows. Then on top of it create an intrusive list.

Related

Get row vector of 2d vector in C++

I have a vector of vector in C++ defined with: vector < vector<double> > A;
Let's suppose that A has been filled with some values. Is there a quick way to extract a row vector from A ?
For instance, A[0] will give me the first column vector, but how can I get quicky the first row vector?

There is no "quick" way with that data structure, you have to iterate each column vector and get the value for desired row and add it to temporary row vector. Wether this is fast enough for you or not depends on what you need. To make it as fast as possible, be sure to allocate right amount of space in the target row vector, so it doesn't need to be resized while you add the values to it.
Simple solution to performance problem is to use some existing matrix library, such as Eigen suggested in comments.
If you need to do this yourself (because it is assignment, or because of licensing issues, or whatever), you should probably create your own "Matrix 2D" class, and hide implementation details in it. Then depending on what exactly you need, you can employ tricks like:
have a "cache" for rows, so if same row is fetched many times, it can be fetched from the cache and a new vector does not need to be created
store data both as vector of row vectors, and vector of column vectors, so you can get either rows or columns at constant time, at the cost of using more memory and making changes twice as expensive due to duplication of data
dynamically change the internal representation according to current needs, so you get the fixed memory usage, but need to pay the processing cost when you need to change the internal representation
store data in flat vector with size of rows*columns, and calculate the correct offset in your own code from row and column
But it bears repeating: someone has already done this for you, so try to use an existing library, if you can...

There is no really fast way to do that. Also as pointed out, I would say that the convention is the other way around, meaning that A[0] is actually the first row, rather than the first column. However even trying to get a column is not really trivial, since
{0, 1, 2, 3, 4}
{0}
{0, 1, 2}
is a very possible vector<vector<double>> A, but there is no real column 1, 2, 3 or 4. If you wish to enforce behavior like same length columns, creating a Matrix class may be a good idea (or using a library).
You could write a function that would return a vector<double> by iterating over the rows, storing the appropriate column value. But you would have to be careful about whether you want to copy or point to the matrix values (vector<double> / vector<double *>). This is not very fast as the values are not next to each other in memory.

The answer is: in your case there is no corresponding simple option as for columns. And one of the reasons is that vector> is a particular poorly suited container for multi-dimensional data.
In multi-dimension it is one of the important design decisions: which dimension you want to access most efficiently, and based on the answer you can define your container (which might be very specialized and complex).
For example in your case: it is entirely up to you to call A[0] a 'column' or a 'row'. You only need to do it consistently (best define a small interface around which makes this explicit). But STOP, don't do that:
This brings you to the next level: for multi-dimensional data you would typically not use vector> (but this is a different issue). Look at smart and efficient solutions already existing e.g. in ublas https://www.boost.org/doc/libs/1_65_1/libs/numeric/ublas/doc/index.html or eigen3 https://eigen.tuxfamily.org/dox/
You will never be able to beat these highly optimized libraries.

Insert element at position i and return info based on i first elements

Let's say I have a list of integers:
2, 1, 3, 1, 4, 2, 5, 3, 2
I want to be able to insert a new integer at position i. So let's say i is 4, and I want to insert the number 7. The result would be:
2, 1, 3, 7, 1, 4, 2, 5, 3, 2
After the insertion, I would like to receive some information based on numbers at positions i and lower. For example, the sum of the first i numbers. In this case it would be 2 + 1 + 3 + 7 = 13.
I want to be able to repeat this process over and over.
I wrote a program in C++ that uses std::list. Here's what it does to insert n at position i into List and then return the sum of i first numbers:
Compare the last insert position k with i. If it's lower, calculate sum[j] for each j: k < j < i like this: sum[j] = sum[j-1] + List[j] - O(n)
Find position i - O(n)
Insert n at position i, store k = i - O(1)
Calculate and return sum[i] = sum[i-1] + n - O(1)
Can this be done more efficiently, perhaps using a different data structure? In O(logn) maybe? If yes then how?

If you want an out-of-the-box solution without rolling a new data structure or using a third party lib, std::vector would be your best bet. The algorithmic complexity would be:
Compare the last insert position k with i. If it's lower, calculate sum: O(n)
Find position i: O(1) or O(n) if it involves some kind of search. If there's a search involved, it will still be substantially faster than std::list.
Insert n at position i: O(n)
Calculate and return sum[i] = sum[i-1] + n: O(1)
This might not seem better from an algorithmic/scalability standpoint, yet it wouldn't be due to algorithmic complexity that we would typically see a considerable performance improvement. It'd be due to locality of reference (spatial locality in particular).
The machine can plow through contiguous data sequentially very quickly, since multiple adjacent elements can be accessed prior to being evicted from a cache line. std::vector has that going for it in spades, and we end up benefiting from its rapid, contiguous, sequential access for all 4 cases above.
std::list, when used with std::allocator (especially in a context where not all nodes are allocated at once), tends to invoke a lot of cache-misses since it lacks spatial locality (also, in part, due to the overhead of the list pointers which reduces the number of elements that can fit into a cache line, and in this particular case, substantially since we require two list pointers per measly integer).
Note that potentially more optimal solutions exist when venturing outside the standard library which are tuned for your specific problem, as mentioned in the other nice answer. Another angle that delves into lower-level details is to seek your own custom allocator which can really help just about any kind of linked structure. This answer focuses on vanilla C++. There vector is often your best bet (unless given some strong reasons otherwise) when dealing with a sequential container given its contiguous, cache-friendly representation.

As #andyg mentioned in comments this is a job that suits Fenwick tree or Binary Indexed tree. Binary Indexed Tree can do insertion and update in O(logn), and query ( sum from beginning to an index ) in O(logn). There is a very good article about Binary Indexed Tree here.
Also this job can be done with segment tree but as implementation of Binary Indexed Tree is so much simpler, I recommend using Binary Indexed Tree.

Algorithm for merging short lists into a long vector

I have a sparse matrix class whose non-zeros and corresponding column indices are stored, in row-order, in what are basically STL-vector-like containers. They may have unused capacity, like vectors; and to insert/remove elements, existing elements must be moved.
Say I have an operation, insert_erase_replace, or ier for short. ier can do the following, given a position p, a column index j, and a value v:
if v==0, ier removes the entry at p and left-shifts all subsequent entries.
if v!=0, and j is already present at p, ier replaces the cell contents at p with v.
if v!=0, and j is not present at p, ier inserts the entry v and column index j at p after right-shifting all subsequent entries.
So all of that is trivial.
Now let's say I have ier2, which does the same thing, except that it takes a list containing multiple column indices j and corresponding values v. It also has a size n, which indicates how many index/value pairs are present in the list. But because the vector only stores non-zeros, sometimes the actual insertion size is smaller than n.
Still trivial.
But now let's say I have ier3, which takes not just one list like ier2, but multiple lists. This represents editing a slice of the sparse matrix.
At some point, it becomes more efficient to iterate through the vectors, copying them piece by piece and inserting/replacing/erasing the list indices/values ier2-style as we arrive at each insertion point. And if the total insertion size would cause my vector to need a resize anyway, then we do that.
Given that my vector is much, much larger than the total length of the lists, is there an algorithm for efficiently merging the lists into the vector?
So far, here's what I have:
Each list passed to ier3 represents either a net deletion of entries (a left shift), a net replacement (no movement, therefore cheap), or a net insertion of entries (a right shift). There may also be some re-arrangement of elements in there, but the expensive parts are the net deletions and net insertions.
It's not hard to figure out an algorithm for efficiently doing ONLY net insertions or net deletions.
It's harder when either of the two may be happening.
The only thing I can think to do is to handle it in two passes:
Erase/replace
Insert/replace
We erase first because it makes it more likely that any insertions will require fewer copies.
Is this the right approach? Does anyone know of a better one?

Okay, so I'm going to suppose the intervals covered in each list in ier3 are disjoint and given to you in order. If it's meant for editing slices of a matrix, this seems reasonable. I'm also assuming you that you don't need to resize the vector, because that case is easily detectable and solvable.
Initialise a read pointer and a write pointer to the start of the vector you're editing. There'll be an instruction pointer into ie3 too, but I'll ignore that here for clarity's sake. You'll also need a queue. At each step, one of several things can happen:
Default: Neither read nor write are at a position detailed by ier3. In this case, add the element under read to the back of the queue and write the element at the front of the queue to the cell under write. Move both pointers forward one.
read is over a cell that needs to be deleted. In this case, simply move read forward one without adding anything to the queue.
read passes from one cell to the next such that an insertion should happen between them. In this case, add the insertion to the back of the queue and then continue with the next relevant case.
read is at a cell that needs to be modified. In this case, insert the modified cell at the back of the queue, write whatever's at the front of the queue to write, and step them both forwards.
read has arrived at the unused capacity of the vector. In which case just write whatever's left in the queue.
That's the basic outline, but a couple of optimizations can be made: first, if the queue's empty, step both pointers forward to the next position detailed by ie3 without doing anything. Second, minimize the buffer by doing extra writing steps whenever read is ahead of write and the queue is nonempty.

I'd go with your plan with a few important points highlighted.
The erase/replace step should start from the left and only move points within the affected range - it can leave a "gap". It should determine the size of the final vector. At the end of this step, use the determined size to shift the "tail" of the vector as needed, leaving the exact amount of space required for insertions free.
The insertions should start from the right and fill up the gap we left in step 1 by copying each point to it's final position.
This will never shift the main vector once and never copy any point (from the existing slice or insertion set) more than twice so it's essentially linear.
Other data structures might be helpful too - reserving space at both the front and end, or building it out of multiple sections so a resize doesn't force a full copy.
One further optimisation would be to allow some insertions during step 1. If you've erased some, completing any insertion you come across immediately until it balances will prevent you needing to move any points until you reach another erase.

Let n be the size of the list and m be the size of the vector. It sounds like ier does a binary search for j every time, so the searching part is O(n*log(m)).
Assuming the elements in the list are sorted, once you find the first element, it's faster to just navigate up the vector to find the next one. That way searching becomes O(log(m) + n) = O(n).
Also, do a dry pass first to count net deletions/insertions, and a second pass to actually apply the changes. I think these two passes will run faster than the two you describe.

I can suggest a different design for a sparse matrix that should help you achieve performance and a low memory footprint for large sparse matrices.
Instead of vector, why not use a 2D hash table. something like (no std:: for smaller code):
typedef unordered_map< unsigned /* index */, int /* value */ > col_type;
unordered_map< unsigned /* index */, col_type*>; // may need to define hash function for col_type
the outer class (sparse_matrix) searches in O(1) for a column. If not found, it allocates a new column.
Then the column type is searched for the column index in O(1) and either delete/replace or insert based on the original logic. It can see if the column is now empty and delete it from the 'row' hash map.
all basic operations add/delete/replace are O(1).
If you need a fast ordered iteration of the matrix, you can replace the unordered_map with 'map'. If the matrix is very sparse, the O(nlog(n)) complexity will be similar to the hash_map's.
BTW I used pointer to the col_type on purse, the outer hash map grows much (much!) faster this way.

How to undo a popFront a range

What is the standard way to "undo" a popFront operation? I realize that this would not work on all ranges, but for things like arrays, say you had
int[] a = [ 1, 2, 3 ];
And you did a.popFront() which would adjust the start pointer of a to point at the 2, how would you undo that operation to get back the 1 in the range? I am aware of std.container.insertFront but that is not the operation I am looking for.
I have tried
a = a[1..$];
a = a[-1..$];
but the second line throws a RangeError. Also, arrays support slicing, but the method I am looking for should support non-random-access ranges and ranges that do not support slicing. So even if a[-1..$] did work, it wouldn't solve my problem.

The standard way would be to save a copy of the range before popping. Popping is a destructive mutation, and the range is free to deallocate the element, rebalance the underlying tree, or otherwise invalidate the previous element.
Thus:
MyRange old = current.save;
current.popFront();
if (current.front == magicValue) {
current = old;
}

You don't undo popFront. You can't even do something equivalent to arr[-1 .. $] with arrays. If you want the old version, you have to save it first.
auto saved = range.save;
range.popFront();
range = saved; // "undo" popFront()
Arrays do not provide any more functionality than that either. To do the same thing with arrays without the range API, you'd have to do something like
auto saved = arr;
arr = arr[1 .. $];
arr = saved;
The only way to "undo" a pop operation on a range or array is to save it first and then use the old version. Nothing else is provided by either the range API or by arrays. They do not save their state on their own (and therefore could not know how to undo a previous operation), and not even array slices have any idea what data may be before or after them in memory (and trying to access the memory before or after an array would be illegal as you saw when you hit a RangeError).
So, if you have to worry about "undoing" the popping off of an arbitrary number of elements, then you're probably going to have to have to do something like hold onto the original range and keep track of how many elements you've popped of it so that you can pop off that number of elements minus the number of levels of "undo" that you want. And while not much copying is likely to be going on here (for arrays, it would just be multiple slices pointing to the same memory but in different places in it), if you're not dealing with a range with slicing, all of that popping off could be expensive (as could holding a saved version of the range from before each element is popped off), especially if you were trying to undo one level at a time, so the range API may not be very well suited to what you're trying to do, and you may have to rethink how you're go about it.

Structure for top hit objects

I want to have a hit parameter for objects that are received, showing its frequency. and being able to have the most frequent, top hit, objects.
Unordered_map fits the first part, having object as the key and hit as the value.
unordered_map<object,int>
It enables searching fast for object and incrementing its hit. But how about sorting? priority_queue enables having the top hit object. But how about incrementing the object's hit?

I would suggest you have a look at splay tree that keeps objects in a way that most recent and most frequnetly accessed objects are closer to the top. This relies on several euristicts and thus will give you an approximation of the perfect solution.
For an exact solution it is better to implement your own binary heap and implement the operation icrement priority. In theory the same is used for backing for priority_queue, but there is no cahnge priority operation, while it can be done without affecting the complexity of the data structure's operations.

I managed to solved it by keeping track of sorted list of objects by their hit number as I insert the objects. So there is always the list of the most N top hits. There are 3,000,000 objects and I want to have the top 20.
Here are the structures I used:
key_hit to keep track of hits (by key, a string, I mean the object):
unordered_map<string, int> key_hit;
two arrays : hits[N], keys[N] which contains the top hits and their corresponding key (object).
idx, hits, keys
0, 212, x
1, 200, y
...
N, 12, z
and another map key_idx to keep the key and its corresponding index:
unordered_map<string,int> key_idx;
Algorithm (without details):
key is input.
search the key in key_hit, find its hit and increment (this is fast enough).
if hit<hits[N], ignore it.
else, idx=key_idx[key], (if not found, add it to structures and delete the existing one. it too long to write all details)
H=h[idx]++
check whether it is greater than the above entry, h[idx-1]<H. if yes, swap idx and idx-1 in key_idx,hits,keys.
I tried to make it fast. but I don't know how far it's fast.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js