How does arma::find_unique determine unique indices? - c++

I am using arma::find_unique, and I thought it returned the index of the first occurrence of each unique value in a vector, but it appears to return something else.
Here is a toy function:
// [[Rcpp::export]]
arma::uvec test(arma::vec& x_) {
vec x=arma::sort(x_);
return arma::find_unique(x);
}
If I run the function in R with a simple vector test(5:1) I get a vector of all the indices 0,1,2,3,4 which makes sense since each value is unique.
If I try something like:
set.seed(1991)
var=sample(1:8,20,TRUE)
test(var)
OUTPUT:
1,3,6,7,19,12,14,18.
All those values make sense except the first one. Why is the first unique value at index 1 and not 0? Clearly I am misunderstanding what arma::find_unique intends to do so I would appreciate if someone could enlighten me.
EDIT
My session information

Okay, the following is courtesy of #nrussell, the man is amazing, and was given in the comments to this "answer." (I do not deserve the check mark nor upvotes.)
Actually, I'm pretty sure this is all just a misinterpretation of the Armadillo documentation, which never actually guarantees that a stable sort is used, as #Carl was expecting. Underneath, std::sort is being called, which is not guaranteed to be a stable sort by the C++ standard; also stated here:
"The order of equal elements is not guaranteed to be preserved."
I can demonstrate this here, replicating the "packet" structure use in the Armadillo's algorithm. My guess is that libc++ (typically used by OS X) does implement std::sort as a stable sort, while libstdc++ does not.
My turn: The stable sort, or maintaining the relative order of records with equal keys (i.e. values), is the key issue behind this question. For example, consider the following:
dog car pool dig
Sorting by the first letter with a stable sort gives us:
car dog dig pool
Because the word "dog" appeared prior to "dig" in the vector, it therefore must appear before "dig" in the output.
Sorting by the first letter with a unstable sort gives us:
car dig dog pool
or
car dog dig pool
The principal is relevant to numbers since each key generate is literally present elsewhere. So, we have:
2, 3, 2, 4
Thus, when the unique values are found:
2, 3, 4
The 2 can take id either 0 or 2.
As #nrussell explained, macOS since OS X Mavericks (10.9) relies by default on --stdlib=libc++ vs. the traditional --stdlib=libstdc++ flag for compiling. This was likely the reason why I was unable to replicate it as one implementation opts for stability while the other does not.
Original Answer
First, I'm not able to replicate this on macOS... (See end)
It seems as if we are able to repro this on Linux though (#nrussel). Which means at some point, there is an issue given in the linked code.
Secondly, arma::find_unique is implemented here using matrix ops with op_find_unique. The later is the key as it implements the comparators.
Thus, in short, there should be no way that is possible given that you sort the vector and the first item is always considered to be unique.
Test function
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::uvec test(arma::vec& x_) {
Rcpp::Rcout << "Input:" << x_.t() << std::endl;
arma::vec x = arma::sort(x_);
Rcpp::Rcout << "Sorted:" << x.t() << std::endl;
arma::uvec o = arma::find_unique(x);
Rcpp::Rcout << "Indices:" << o.t() << std::endl;
return o;
}
/*** R
set.seed(1991)
(v=sample(1:8,20,TRUE))
## [1] 2 2 1 5 7 6 7 6 4 1 5 3 1 4 4 2 8 7 7 8
sort(v)
## [1] 1 1 1 2 2 2 3 4 4 4 5 5 6 6 7 7 7 7 8 8
test(v)
### Received
## 2.0000 2.0000 1.0000 5.0000 7.0000 6.0000 7.0000 6.0000 4.0000 1.0000 5.0000 3.0000 1.0000 4.0000 4.0000 2.0000 8.0000 7.0000 7.0000 8.0000
### Sorted
## 1.0000 1.0000 1.0000 2.0000 2.0000 2.0000 3.0000 4.0000 4.0000 4.0000 5.0000 5.0000 6.0000 6.0000 7.0000 7.0000 7.0000 7.0000 8.0000 8.0000
### Output
## 0 3 6 7 10 12 14 18
*/

Related

Ignore NaNs in mean and other stats functions using Armadillo

If a matrix contains NaN values, Armadillo will return NaN for stats performed on the columns/rows containing these NaNs. I.e. the following code
arma::mat A = {{1, 2, 3}, {6, 7, 8}, {4, 9, 10}};
A(1,1) = arma::datum::nan;
std::cout << A << "\n";
std::cout << arma::mean(A) << "\n" << arma::mean(A, 1);
will return
1.0000 2.0000 3.0000
6.0000 nan 8.0000
4.0000 9.0000 10.0000
3.6667 nan 7.0000
2.0000
nan
7.6667
Is there an efficient way to ignore the NaN values much like MATLAB's nanmean() / mean(-, 'omitnan')?
The column-wise mean would then return 5.5, the row-wise mean 7 instead of NaN.

Arrayfire sparse matrix issues

Getting confused with something that should be simple. Spent a bit of time trying to debug this and am not getting too far. Would appreciate if someone could help me out.
I am trying to define a sparse matrix in arrayfire by specifying the value/column/row triples as specified in this function. I want to store the following matrix as sparse:
3 3 4
3 10 0
4 0 3
I code it up as follows:
int row[] = {0,0,0,1,1,2,2};
int col[] = {0,1,2,0,1,0,2};
double values[] = { 3,3, 4,3,10,4,3};
array rr = sparse(3,3,array(7,values),array(7,row),array(7,col));
af_print(rr);
af_print(dense(rr));
I get the following output:
rr
Storage Format : AF_STORAGE_CSR
[3 3 1 1]
rr: Values
[7 1 1 1]
1.0000
2.0000
4.0000
3.0000
10.0000
4.0000
3.0000
rr: RowIdx
[7 1 1 1]
0
0
0
1
1
2
2
rr: ColIdx
[7 1 1 1]
0
1
2
0
1
0
2
dense(rr)
[3 3 1 1]
0.0000 0.0000 0.0000
0.0000 0.0000 3.0000
3.0000 0.0000 0.0000
When printing out stored matrix in dense format, I get something completely different than intended.
How do I make the output of printing the dense version of rr give:
3 3 4
3 10 0
4 0 3
Arrayfire uses (a modified) CSR format, so the rowarray has to be of length number_of_rows + 1. Normally it would be filled with the number of non-zero entries per row, i.e. {0, 3 ,2, 2}. But for Arrayfire, you need to take the cumulative sum, i.e. {0, 3, 5, 7}. So this works for me:
int row[] = {0,3,5,7};
int col[] = {0,1,2,0,1,0,2};
float values[] = {3,3,4,3,10,4,3};
array rr = sparse(3,3,array(7,values),array(4,row),array(7,col));
af_print(rr);
af_print(dense(rr));
However, this is not really convenient, since it is quite different from your input format. As an alternative, you could specify the COO format:
int row[] = {0,0,0,1,1,2,2};
int col[] = {0,1,2,0,1,0,2};
float values[] = { 3,3, 4,3,10,4,3};
array rr = sparse(3,3,array(7,values),array(7,row),array(7,col), AF_STORAGE_COO);
af_print(rr);
af_print(dense(rr));
which produces:
rr
Storage Format : AF_STORAGE_COO
[3 3 1 1]
rr: Values
[7 1 1 1]
3.0000
3.0000
4.0000
3.0000
10.0000
4.0000
3.0000
rr: RowIdx
[7 1 1 1]
0
0
0
1
1
2
2
rr: ColIdx
[7 1 1 1]
0
1
2
0
1
0
2
dense(rr)
[3 3 1 1]
3.0000 3.0000 4.0000
3.0000 10.0000 0.0000
4.0000 0.0000 3.0000
See also https://github.com/arrayfire/arrayfire/issues/2134.

Why is infix not properly extracting the correct columns from my fixed-width text file?

I have a fixed-width text file named "OASH2010.txt" that looks like the following
201081127501F H 22 99920 0 0 13860921 0 1.0000
201081127501F H 23 99930 0 0 410026345 0 1.0000
201081129301F H 1 71131 27 51602 1268275327 24578.03 1.0000
201081129301F H 12 99901 0 0 1268275327 0 1.0000
201081129301F H 13 99203 0 0 415264 0 1.0000
201081129301F H 16 99905 28 5798406 14206094 2.45 1.0000
201081129301F H 17 99906 0 0 23261260 0 1.0000
201081129301F H 18 99907 27 4210 27357876 6498.31 1.0000
201081129301F H 20 99204 0 0 12470 0 1.0000
201081129301F H 21 99220 0 0 4044298 0 1.0000
The columns in the file can be extracted based on the character location provided in a README file. For example, the year variable is in the first 4 characters, the observation ID is in the 5th to 12th characters, and so on and so forth.
In order to extract the columns from the text file, I run the following code
#delimit ;
clear;
infix
str YEAR 1-4
str FACT_ID 5-12
str BLK 13-13
str H_I1 14-15
str H_I3 16-20
str H_I4 21-23
str H_I5 24-39
str H_I6 40-54
str H_I7 55-69
str MULT 70-78
using "OASH2010.txt";
From my understanding, infix should ignore the spaces and search until it encounters the next character. This code was not mine originally so presumably the person who wrote it was able to extract the columns from the data. However, it doesn't properly identify and extract the correct columns. I get the following in Stata
Any ideas on why this is happening? Suggestions for how to fix it?
From my understanding, infix should ignore the spaces and search
until it encounters the next character.
Not my understanding: infix should respect the instructions on columns that you give it; that's the point! Else you need another command to do what you want.
I can't speak to the presumption that this worked for someone else. The evidence is that the command you give don't match the file you show.
What seems evident is the first three variables are contiguous in the data file. Thereafter, the convention switches to space separation. It's a mess.
In a situation like this, I don't agonise about how to input. I read everything in as one string variable and then write custom code.
clear
infix str DATA 1-80 using OASH2010.txt
gen YEAR = real(substr(DATA, 1, 4))
gen FACT_ID = substr(DATA, 5, 8)
gen BLK = substr(DATA, 13, 1)
replace DATA = substr(DATA, 14, .)
split DATA, gen(H_I) destring
rename HI_8 MULT
It seems obvious that YEAR can be made numeric. Conversely, some of the other variables seem identifiers of some kind and if so are better left as string. Using destring as an option on split is literally optional; a selective destring on what you know should really be numeric is good strategy.

Split Data knowing its common ID

I want to split this data,
ID x y
1 2.5 3.5
1 85.1 74.1
2 2.6 3.4
2 86.0 69.8
3 25.8 32.9
3 84.4 68.2
4 2.8 3.2
4 24.1 31.8
4 83.2 67.4
I was able, making match with their partner like,
ID x y ID x y
1 2.5 3.5 1 85.1 74.1
2 2.6 3.4 2 86.0 69.8
3 25.8 32.9
4 24.1 31.8
However, as you notice some of the new row in ID 4 were placed wrong, because it just got added in the next few rows. I want to split them properly without having to use complex logic which I am already using... Someone can give me an algorithm or idea?
it should looks like,
ID x y ID x y ID x y
1 2.5 3.5 1 85.1 74.1 3 25.8 32.9
2 2.6 3.4 2 86.0 69.8 4 24.1 31.8
4 2.8 3.2 3 84.4 68.2
4 83.2 67.4
It seems that your question is really about clustering, and that the ID column has nothing to do with the determining which points correspond to which.
A common algorithm to achieve that would be k-means clustering. However, your question implies that you don't know the number of clusters in advance. This complicates matters, and there have been already a lot of questions asked here on StackOverflow regarding this issue:
Kmeans without knowing the number of clusters?
compute clustersize automatically for kmeans
How do I determine k when using k-means clustering?
How to optimal K in K - Means Algorithm
K-Means Algorithm
Unfortunately, there is no "right" solution for this. Two clusters in one specific problem could be indeed considered as one cluster in another problem. This is why you'll have to decide that for yourself.
Nevertheless, if you're looking for something simple (and probably inaccurate), you can use Euclidean distance as a measure. Compute the distances between points (e.g. using pdist), and group points where the distance falls below a certain threshold.
Example
%// Sample input
A = [1, 2.5, 3.5;
1, 85.1, 74.1;
2, 2.6, 3.4;
2, 86.0, 69.8;
3, 25.8, 32.9;
3, 84.4, 68.2;
4, 2.8, 3.2;
4, 24.1, 31.8;
4, 83.2, 67.4];
%// Cluster points
pairs = nchoosek(1:size(A, 1), 2); %// Rows of pairs
d = sqrt(sum((A(pairs(:, 1), :) - A(pairs(:, 2), :)) .^ 2, 2)); %// d = pdist(A)
thr = d < 10; %// Distances below threshold
kk = 1;
idx = 1:size(A, 1);
C = cell(size(idx)); %// Preallocate memory
while any(idx)
x = unique(pairs(pairs(:, 1) == find(idx, 1) & thr, :));
C{kk} = A(x, :);
idx(x) = 0; %// Remove indices from list
kk = kk + 1;
end
C = C(~cellfun(#isempty, C)); %// Remove empty cells
The result is a cell array C, each cell representing a cluster:
C{1} =
1.0000 2.5000 3.5000
2.0000 2.6000 3.4000
4.0000 2.8000 3.2000
C{2} =
1.0000 85.1000 74.1000
2.0000 86.0000 69.8000
3.0000 84.4000 68.2000
4.0000 83.2000 67.4000
C{3} =
3.0000 25.8000 32.9000
4.0000 24.1000 31.8000
Note that this simple approach has the flaw of restricting the cluster radius to the threshold. However, you wanted a simple solution, so bear in mind that it gets complicated as you add more "clustering logic" to the algorithm.

No O(1) operation to join elements from two forward_lists?

When reading about forward_list in the FCD of C++11 and N2543 I stumbled over one specific overload of splice_after (slightly simplified and let cit be const_iterator):
void splice_after(cit pos, forward_list<T>& x, cit first, cit last);
The behavior is that after pos everything between (first,last) is moved to this. Thus:
this: 1 2 3 4 5 6 x: 11 12 13 14 15 16
^pos ^first ^last
will become:
this: 1 2 13 14 3 4 5 6 x: 11 12 15 16
^pos ^first ^last
The description includes the complexity:
Complexity: O(distance(first, last))
I can see that this is because one needs to adjust PREDECESSOR(last).next = pos.next, and the forward_list does not allow this to happen in O(1).
Ok, but isn't joining two singly linked lists in O(1) one of the strengths of this simple data structure? Therefore I wonder -- is there no operation on forward_list that splices/merges/joins an arbitrary number of elements in O(1)?
The algorithm would be quite simple, of course. One would just need a name for the operation (pseudocode): (Updated by integrating Kerreks answer)
temp_this = pos.next;
temp_that = last.next;
pos.next = first.next;
last.next = temp_this;
first.next = temp_that;
The result is a bit different, because not (first,last) is moved, but (first,last].
this: 1 2 3 4 5 6 7 x: 11 12 13 14 15 16 17
^pos ^first ^last
will become:
this: 1 2 13 14 15 16 3 4 5 6 7 x: 11 12 17
^pos ^last ^first
I would think this is an as reasonable operation like the former one, that people might would like to do -- especially if it has the benefit of being O(1).
Am I overlooking a operation that is O(1) on many elements?
Or is my assumption wrong that (first,last] might be useful as the moved range?
Or is there an error in the O(1) algorithm?
Let me first give a corrected version of your O(1) splicing algorithm, with an example:
temp_this = pos.next;
temp_that = last.next;
pos.next = first.next;
last.next = temp_this;
first.next = temp_that;
(A sanity check is to observe that every variable appears precisely twice, once set and once got.)
Example:
pos.next last.next
v v
1 2 3 4 5 6 7 11 12 13 14 15 16 17 #
^ ^ ^ ^
pos first last end
becomes:
This: 1 2 13 14 15 16 3 4 5 6 7
That: 11 12 17
Now we see that in order to splice up to the end of that list, we need to provide an iterator to one before the end(). However, no such iterator exists in constant time. So basically the linear cost comes from discovering the final iterator, one way or another: Either you precompute it in O(n) time and use your algorithm, or you just splice one-by-one, also in linear time.
(Presumably you could implement your own singly-linked list that would store an additional iterator for before_end, which you'd have to keep updated during the relevant operations.)
There was considerable debate within the LWG over this issue. See LWG 897 for some of the documentation of this issue.
Your algorithm fails when you pass in end() as last because it will try to use the one-past-end node and relink it into the other list. It would be a strange exception to allow end() to be used in every algorithm except this one.
Also I think first.next = &last; needs to be first.next = last.next; because otherwise last will be in both lists.