constructing a Data Frame in Rcpp - c++

I want to construct a data frame in an Rcpp function, but when I get it, it doesn't really look like a data frame. I've tried pushing vectors etc. but it leads to the same thing. Consider:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::DataFrame dfout;
for (int i=0;i<dfin.length();i++) {
dfout.push_back(dfin(i));
}
return dfout;
}
in R:
> .Call("makeDataFrame",mtcars,"myPkg")
[[1]]
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
[[2]]
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
[[3]]
[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
[[4]]
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
[20] 65 97 150 150 245 175 66 91 113 264 175 335 109
[[5]]
[1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93
[16] 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62
[31] 3.54 4.11
[[6]]
[1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070
[13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
[25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780
[[7]]
[1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90 17.40
[13] 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87 17.30 15.41
[25] 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60
[[8]]
[1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1
[[9]]
[1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
[[10]]
[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
[[11]]
[1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2

Briefly:
DataFrames are indeed just like lists with the added restriction of having to have a common length, so they are best constructed column by column.
The best way is often to look for our unit tests. Her inst/unitTests/runit.DataFrame.R
regroups tests for the DataFrame class.
You also found the .push_back() member function in Rcpp which we added for convenience and analogy with the STL. We do warn that it is not recommended: due to differences with the way R objects are constructed, we essentially always need to do full copies .push_back is not very efficient.
Despite me answering here frequently, the rcpp-devel list a better place for Rcpp questions.

It seems Rcpp can return a proper data.frame, provided you supply the names explicitely. I'm not sure how to adapt this to your example with arbitrary names
mkdf <- '
Rcpp::DataFrame dfin(input);
Rcpp::DataFrame dfout;
for (int i=0;i<dfin.length();i++) {
dfout.push_back(dfin(i));
}
return Rcpp::DataFrame::create( Named("x")= dfout(1), Named("y") = dfout(2));
'
library(inline)
test <- cxxfunction( signature(input="data.frame"),
mkdf, plugin="Rcpp")
test(input=head(iris))

Using the information from #baptiste's answer, this is what finally does give a well formed data frame:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::DataFrame dfout;
Rcpp::CharacterVector namevec;
std::string namestem = "Column Heading ";
for (int i=0;i<2;i++) {
dfout.push_back(dfin(i));
namevec.push_back(namestem+std::string(1,(char)(((int)'a') + i)));
}
dfout.attr("names") = namevec;
Rcpp::DataFrame x;
Rcpp::Language call("as.data.frame",dfout);
x = call.eval();
return x;
}
I think the point remains that this might be inefficient due to push_back (as suggested by #Dirk) and the second Language call evaluation. I looked up the rcpp unitTests, and haven't been able to come up with something better yet. Anybody have any ideas?
Update:
Using #Dirk's suggestions (thanks!), this seems to be a simpler, efficient solution:
RcppExport SEXP makeDataFrame(SEXP in) {
Rcpp::DataFrame dfin(in);
Rcpp::List myList(dfin.length());
Rcpp::CharacterVector namevec;
std::string namestem = "Column Heading ";
for (int i=0;i<dfin.length();i++) {
myList[i] = dfin(i); // adding vectors
namevec.push_back(namestem+std::string(1,(char)(((int)'a') + i))); // making up column names
}
myList.attr("names") = namevec;
Rcpp::DataFrame dfout(myList);
return dfout;
}

I concur with joran. The output of a C function called from within R is a list of all its arguments, both "in" and "out", so each "column" of the dataframe could be represented in the C function call as an argument. Once the result of the C function call is in R, all that remains to be done is to extract those list elements using list indexing and give them the appropriate names.

Related

Why does my Rcpp function sometimes return NaN and other times the correct answer for the same inputs?

I am writing an Rcpp function to multiply a banded matrix by a vector. I have an R wrapper function that makes it easier to call the C++ function. I get the same issue regardless of whether I run the C++ function directly or through the R wrapper function
#library(Rcpp)
##########params#############
#p: lower bandwidth
#q: upper bandwidth
#A: data matrix of size n*m
#x: vector to multiply by
#n: number of rows
#m: number of columns
cppFunction("NumericVector bandMatVecMult(int p, int q, NumericMatrix A, NumericVector x,
int n, int m) {
NumericVector prods (n);
for (int i = 0; i < n; i++) {
int istart = std::max(0, std::min(i - p, m - 1));
int istop = std::min(m, i + q);
NumericVector Arow = A(i, _);
Arow = Arow[Range(istart, istop)];
NumericVector b = x[Range(istart, istop)];
prods[i] = std::inner_product(Arow.begin(), Arow.end(), b.begin(), 0.0);
//NumericVector prodsVec = Arow * b;
//prods[i] = sum(prodsVec);
}
return prods;
}")
bandMatVecMultR = function(l.bandwidth, u.bandwidth, mat, vec = 1:dim(mat)[2]){
dim.mat = dim(mat)
result = bandMatVecMult(p = as.integer(l.bandwidth),
q = as.integer(u.bandwidth),
A = as.matrix(mat),
x = vec,
n = as.integer(dim.mat[1]),
m = as.integer(dim.mat[2]))
return(result)
}
I run the code on the same input 100 times to show my issue.
for (i in 1:100){
a = bandMatVecMultR(81, 0, y, vec = 1:10)
print(a)
}
Sometimes I get the correct answer, other times something else. Again, the input stays constant. The issue persists whether I use the inner_product function or calculate dot product myself
#a correct output
[1] 1 1 2 2 5 4 3 3 3 3 6 6 5 4 4 4 4 4 4 10 12 8 10 11 9 5 5 5 5 5 11 9
[33] 7 8 9 7 7 6 6 6 17 14 12 12 9 12 7 7 7 7 27 25 23 16 15 9 12 10 18 20 12 14 13 14
[65] 17 15 19 16 13 10 13 10 10 9 9 38 26 25 20 20 19 21 18 17 18 19 10 10 10 10 10 0 0 0 0 0
[97] 0 0 0 0
#an output with random NaN
[1] 1 1 2 2 5 4 3 3 3 3 NaN 6 5 4 NaN 4 4 4 4 10 12 8 10 11
[25] 9 5 5 5 5 5 11 9 7 8 9 7 NaN 6 6 6 NaN 14 12 12 NaN 12 7 7
[49] 7 NaN 27 25 23 16 15 9 12 10 NaN 20 12 14 13 14 17 15 19 16 13 NaN 13 10
[73] 10 9 9 38 26 25 20 20 19 21 18 17 18 19 10 10 10 10 10 0 0 0 0 NaN
[97] 0 0 0 0
#NaN and different formatting
[1] 1.000000e+00 1.000000e+00 2.000000e+00 2.000000e+00 5.000000e+00 4.000000e+00
[7] 3.000000e+00 3.000000e+00 3.000000e+00 3.000000e+00 6.000000e+00 6.000000e+00
[13] 5.000000e+00 4.000000e+00 4.000000e+00 4.000000e+00 4.000000e+00 4.000000e+00
[19] 4.000000e+00 NaN 1.200000e+01 8.000000e+00 1.000000e+01 1.100000e+01
[25] 9.000000e+00 5.000000e+00 5.000000e+00 5.000000e+00 5.000000e+00 5.000000e+00
[31] 1.100000e+01 9.000000e+00 7.000000e+00 8.000000e+00 9.000000e+00 7.000000e+00
[37] 7.000000e+00 6.000000e+00 6.000000e+00 6.000000e+00 1.700000e+01 1.400000e+01
[43] 1.200000e+01 1.200000e+01 9.000000e+00 1.200000e+01 7.000000e+00 7.000000e+00
[49] 7.000000e+00 7.000000e+00 2.700000e+01 2.500000e+01 2.300000e+01 1.600000e+01
[55] 1.500000e+01 9.000000e+00 1.200000e+01 1.000000e+01 1.800000e+01 2.000000e+01
[61] 1.200000e+01 1.400000e+01 1.300000e+01 1.400000e+01 1.700000e+01 1.500000e+01
[67] 1.900000e+01 1.600000e+01 1.300000e+01 1.000000e+01 1.300000e+01 1.000000e+01
[73] 1.000000e+01 9.000000e+00 9.000000e+00 3.800000e+01 2.600000e+01 2.500000e+01
[79] 2.000000e+01 2.000000e+01 1.900000e+01 2.100000e+01 1.800000e+01 1.700000e+01
[85] 1.800000e+01 1.900000e+01 1.000000e+01 1.000000e+01 1.000000e+01 1.000000e+01
[91] 1.000000e+01 0.000000e+00 0.000000e+00 1.671951e-131 0.000000e+00 0.000000e+00
[97] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
I get this error for larger matrices (1000 x 10), but I haven't been able to reproduce it for smaller matrices (10 x 10). I have custom code that turns a matrix into banded form, and my regular computations work fine on it (y %*% vec works fine). In addition, I have a version of this function that multiplies a matrix by vector without the banded assumption, and it does not produce this error.
cppFunction("NumericVector matVecMult(NumericMatrix A, NumericVector x,
int n, int m) {
NumericVector prods(n);
for (int i = 0; i < n; i++) {
NumericVector Arow = A(i, _);
prods[i] = std::inner_product(Arow.begin(), Arow.end(), x.begin(), 0.0);
}
return prods;
}")
matVecMultR = function(mat, vec = 1:dim(mat)[2]){
dim.mat = dim(mat)
result = matVecMult(A = as.matrix(mat),
x = vec,
n = as.integer(dim.mat[1]),
m = as.integer(dim.mat[2]))
return(result)
}
I just started using Rcpp yesterday and haven't written C++ code until yesterday. What is causing the issue, and how can I fix it? Thanks in advance

Pandas grouped differences with variable lags

I have a pandas data frame with three variables. The first is a grouping variable, the second a within group "scenario" and the third an outcome. I would like to calculate the within group difference between the null condition, scenario zero, and the other scenarios within the group. The number of scenarios varies between the different groups. My data looks like:
ipdb> aDf
FieldId Scenario TN_load
0 0 0 134.922952
1 0 1 111.787326
2 0 2 104.805951
3 1 0 17.743467
4 1 1 13.411849
5 1 2 13.944552
6 1 3 17.499152
7 1 4 17.640090
8 1 5 14.220673
9 1 6 14.912306
10 1 7 17.233862
11 1 8 13.313953
12 1 9 17.967438
13 1 10 14.051882
14 1 11 16.307317
15 1 12 12.506358
16 1 13 16.266233
17 1 14 12.913150
18 1 15 18.149811
19 1 16 12.337736
20 1 17 12.008868
21 1 18 13.434605
22 2 0 454.857959
23 2 1 414.372215
24 2 2 478.371387
25 2 3 385.973388
26 2 4 487.293966
27 2 5 481.280175
28 2 6 403.285123
29 3 0 30.718375
... ... ...
29173 4997 3 53.193992
29174 4997 4 45.800968
I will also have to write functions to get percentage differences etc. but this has me stumped. Any help greatly appreciated.
You can get the difference with the scenario 0 within groups using groupby and transform like:
df['TN_load_0'] = df['TN_load'].groupby(df['FieldId']).transform(lambda x: x - x.iloc[0])
df
FieldId Scenario TN_load TN_load_0
0 0 0 134.922952 0.000000
1 0 1 111.787326 -23.135626
2 0 2 104.805951 -30.117001
3 1 0 17.743467 0.000000
4 1 1 13.411849 -4.331618
5 1 2 13.944552 -3.798915
6 1 3 17.499152 -0.244315

How to add dummy row based on one column in pandas dataframe?

I'm working with pandas,So basically i've two dataframes and the number of rows are different in both the cases:
df
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 5520.50 1 0.06148 0.12556 8.21685 5520.484742
df1
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28616 0.07521 22.91064 4050.327388
1 4208.98 6 0.48781 0.08573 44.51609 4208.990029
2 4374.94 9 0.71548 0.11437 87.10152 4374.944513
3 4379.74 10 0.31338 0.09098 30.34791 4379.778009
4 4398.01 15 0.49950 0.08612 45.78707 4398.020367
5 4502.21 9 0.56362 0.10114 60.67868 4502.223123
6 4508.28 3 0.69554 0.11600 85.88428 4508.291777
7 4512.99 2 0.20486 0.08891 19.38745 4512.999332
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
So there are some rows in df1 that are not in df. So i want to add the row to the dataframe and reset the index accordingly. Previously i was just removing the extra rows from the dataframe to keep them equal but now i just want to add an empty row of the index of column isn't there.
The desired result should look like this,
wave num stlines fwhm EWs MeasredWave
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14 0.50415 0.09845 52.83236 4398.007473
5 4502.21 0 0 0 0 0
6 4508.28 0 0 0 0 0
7 4512.99 0 0 0 0 0
8 5520.50 1 0.06148 0.12556 8.21685 5520.484742
How can i get this?
IIUC, you can use DataFrame.loc to update the values of df1 where wave doesnt exist in df:
df1.loc[~df1.wave.isin(df.wave), 'num':] = 0
Then use DataFrame.combine_first to make sure that the values in df take precedence:
df_out = df.set_index('wave').combine_first(df1.set_index('wave')).reset_index()
[out]
print(df_out)
wave num stlines fwhm EWs MeasredWave
0 4050.32 3.0 0.28269 0.07365 22.16080 4050.311360
1 4208.98 5.0 0.48122 0.08765 44.90035 4208.972962
2 4374.94 9.0 0.71483 0.11429 86.96497 4374.927110
3 4379.74 9.0 0.31404 0.09107 30.44271 4379.760601
4 4398.01 14.0 0.50415 0.09845 52.83236 4398.007473
5 4502.21 0.0 0.00000 0.00000 0.00000 0.000000
6 4508.28 0.0 0.00000 0.00000 0.00000 0.000000
7 4512.99 0.0 0.00000 0.00000 0.00000 0.000000
8 5520.50 1.0 0.06148 0.12556 8.21685 5520.484742

Count multiple z-scores by two groups with Dplyr mutate Dapply/Lapply

I'm trying to count Z-scores for multiple variables by two groups.
Here's an example:
data = mtcars
The variables I want to get the Z-scores:
vars <- c("mpg", "disp", "hp", "drat", "wt", "qsec")
Counting z-score for one variable (working):
mtcars %>%
group_by(am, vs) %>%
mutate(z_mpg = (mpg - mean(mpg)) / sd(mpg))
The problem is I can't get dapply or lapply working on previous code to run all of the "vars"-variables through, so I'd get all Z-scores at once.
If you have an idea how to do this with normalising data (mean 0, SD 1) while taking the groups in account, instead of z-scoring, that would help me also.
Thanks!
You would use mutate_at and use funs to define your z-score function. In this case it's using . to indicate the column you are mutating.
mtcars %>%
group_by(am, vs) %>%
mutate_at(.cols = vars, funs(z = (. - mean(.)) / sd(.)))
Source: local data frame [32 x 17]
Groups: am, vs [4]
mpg cyl disp hp drat wt qsec vs am gear carb mpg_z_ disp_z_ hp_z_ drat_z_ wt_z_ qsec_z_
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 0.3118089 -0.4852978 -0.7168218 -0.1024154 -0.48795905 0.60787578
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 0.3118089 -0.4852978 -0.7168218 -0.1024154 0.03595488 1.12105734
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 -1.1710339 0.9679756 0.5147599 -0.7890520 0.66286051 -0.09519147
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0.2659345 1.6870444 0.3753676 -1.0547191 0.05956492 -0.36000354
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1.3156017 0.0331832 -0.5745432 0.1266602 -0.86434641 -0.15281032
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 -1.0695190 1.0153670 0.1364973 -1.7435153 0.76407410 0.17268463
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 -0.2703291 0.0331832 1.5237884 0.3872184 -0.69514319 -1.62477907
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1.4799831 -0.5783405 -1.9177872 0.2582986 -0.01232378 0.02243925
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 0.8324905 -0.6984282 -0.3412433 0.7533708 -0.12734568 2.00294653
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 -0.6243679 -0.1529447 0.9964304 0.7533708 0.70656315 -1.13854778
# ... with 22 more rows

Converting dict of dicts into pandas DataFrame - memory issues

I have a data structure that consists of a three-level nested dict that keeps counts of occurrences of a three part object. I'd like to build a DataFrame out of it with a specific shape, but I can't figure out a way to do it that doesn't involve consuming a lot of working memory---because the table is quite large (several GBs at full extent).
The basic functionality looks like this:
class SparseCubeTable:
def __init__(self):
self.table = {}
self.dim1 = []
self.dim2 = []
self.dim3 = []
def increment(self, dim1, dim2, dim3):
if dim1 in self.table:
if dim2 in self.table[dim1]:
if dim3 in self.table[dim1][dim2]:
self.table[dim1][dim2][dim3] += 1
else:
self.dim3.append(dim3)
self.table[dim1][dim2][dim3] = 1
else:
self.dim2.append(dim2)
self.dim3.append(dim3)
self.table[dim1][dim2] = {dim3:1}
else:
self.dim1.append(dim1)
self.dim2.append(dim2)
self.dim3.append(dim3)
self.table[dim1] = {dim2:{dim3:1}}
This was constructed to make summing over keys easier, among other things. A SparseCubeTable is used like this:
In [23]: example = SparseCubeTable()
In [24]: example.increment("thing1", "thing2", "thing3")
In [25]: example.increment("thing1", "thing2", "thing3")
In [26]: example.increment("thing4", "thing5", "thing6")
In [27]: example.increment("thing1", "thing3", "thing5")
And you can get the data like this:
In [29]: example.table['thing1']['thing2']['thing3']
Out[29]: 2
The sort of DataFrame I want looks like this:
1 2 3 4
thing1 thing2 thing3 2
thing1 thing3 thing5 1
thing4 thing5 thing6 1
The DataFrame is going to be saved as an HDF5 db with columns 1-3 indexed and statistical transformations on column 4 (that require the whole table be temporarily in memory).
The problem is that the pandas.DataFrame.from_dict function builds a whole other sort of structure with the keys used as row labels, as far as I understand it. However, trying to use from_records forces me to copy out the whole data structure into a list, meaning that I now have double the memory size to worry about.
I tried implementing the solution in:
Create a pandas DataFrame from generator?
but in 0.12.0 what it ends up doing is first building a giant list of strings which is even worse. I assume writing out the structure to a csv and reading it back in is also going to be terrible on memory.
Is there a better way of doing this? Or should I just try to squeeze memory even further in the SparseCubeTable somehow? It seems so wasteful to have to build an intermediate list data structure to use from_records.
Here is a code for an efficient solution.
Create some data looking like yours. This is a list of 1000 3-tuples
In [1]: import random
In [2]: tags = [ 'thing{0}'.format(i) for i in xrange(100) ]
In [3]: data = [ (random.choice(tags),random.choice(tags),random.choice(tags)) for i in range(1000) ]
Our writing function, makes sure that when we write the index is globally unique (its not actually necessary, but since the index is actually written its 'nicer')
In [4]: def write(store,c):
...: df = DataFrame(c,columns=['dim1','dim2','dim3'])
...: try:
...: nrows = store.get_storer('df').nrows
...: except:
...: nrows = 0
...: df.index += nrows
...: store.append('df',df,data_columns=True)
...: return []
...:
In [5]: collector = []
In [6]: store = pd.HDFStore('data.h5',mode='w')
Iterate thru your data (or from a stream or whatever), and write it.
In [7]: for i, d in enumerate(data):
...: collector.append(d)
...: if i % 100 == 0 and i:
...: collector = write(store,collector)
...:
In [8]: write(store,collector)
Out[8]: []
The store
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/df frame_table (typ->appendable,nrows->1000,ncols->3,indexers->[index],dc->[dim1,dim2,dim3])
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/df frame_table (typ->appendable,nrows->1000,ncols->3,indexers->[index],dc->[dim1,dim2,dim3])
In [10]: store.select('df')
Out[10]:
dim1 dim2 dim3
0 thing28 thing87 thing29
1 thing62 thing70 thing50
2 thing64 thing12 thing98
3 thing33 thing98 thing46
4 thing46 thing5 thing76
5 thing2 thing9 thing21
6 thing1 thing63 thing68
7 thing42 thing30 thing45
8 thing56 thing71 thing77
9 thing99 thing10 thing91
10 thing40 thing9 thing10
11 thing70 thing54 thing59
12 thing94 thing65 thing3
13 thing93 thing24 thing25
14 thing95 thing94 thing86
15 thing41 thing55 thing3
16 thing88 thing10 thing47
17 thing89 thing58 thing33
18 thing16 thing66 thing55
19 thing68 thing20 thing99
20 thing34 thing71 thing28
21 thing67 thing87 thing97
22 thing77 thing74 thing6
23 thing63 thing41 thing30
24 thing14 thing62 thing66
25 thing20 thing36 thing67
26 thing33 thing19 thing58
27 thing0 thing71 thing24
28 thing1 thing48 thing42
29 thing18 thing12 thing4
30 thing85 thing97 thing20
31 thing73 thing71 thing70
32 thing91 thing43 thing48
33 thing45 thing6 thing87
34 thing0 thing28 thing8
35 thing56 thing38 thing61
36 thing39 thing92 thing35
37 thing69 thing26 thing22
38 thing16 thing16 thing79
39 thing4 thing16 thing12
40 thing81 thing79 thing1
41 thing77 thing90 thing83
42 thing53 thing17 thing89
43 thing53 thing15 thing37
44 thing25 thing7 thing20
45 thing44 thing14 thing25
46 thing62 thing84 thing23
47 thing83 thing50 thing60
48 thing68 thing64 thing24
49 thing73 thing53 thing43
50 thing86 thing67 thing31
51 thing75 thing63 thing82
52 thing8 thing10 thing90
53 thing34 thing23 thing12
54 thing66 thing97 thing26
55 thing66 thing53 thing27
56 thing79 thing22 thing37
57 thing43 thing82 thing66
58 thing87 thing53 thing92
59 thing33 thing71 thing97
... ... ...
[1000 rows x 3 columns]
In [11]: store.close()
Then you can do interesting things. If you are not reading the entire set in you may want to chunk this (which is a bit more involved if you are counting things).
In [56]: pd.read_hdf('data.h5','df').apply(lambda x: x.value_counts())
Out[56]:
dim1 dim2 dim3
thing0 12 6 8
thing1 14 7 8
thing10 10 10 7
thing11 8 10 14
thing12 11 14 11
thing13 11 12 7
thing14 8 14 3
thing15 12 11 11
thing16 7 10 11
thing17 16 9 13
thing18 13 8 10
thing19 11 7 8
thing2 9 5 17
thing20 6 7 11
thing21 7 8 8
thing22 4 17 14
thing23 14 11 7
thing24 10 5 14
thing25 11 11 12
thing26 13 10 15
thing27 12 15 16
thing28 11 10 8
thing29 7 7 8
thing3 11 14 14
thing30 11 16 8
thing31 7 6 12
thing32 8 12 9
thing33 13 12 12
thing34 12 8 5
thing35 6 10 8
thing36 6 9 13
thing37 8 10 12
thing38 7 10 4
thing39 14 11 7
thing4 9 7 10
thing40 12 8 9
thing41 8 16 11
thing42 9 11 13
thing43 8 6 13
thing44 9 13 11
thing45 7 13 7
thing46 12 8 13
thing47 9 10 9
thing48 8 9 9
thing49 4 8 7
thing5 13 7 7
thing50 14 12 9
thing51 5 7 11
thing52 9 11 12
thing53 9 15 15
thing54 7 9 13
thing55 6 10 10
thing56 12 11 11
thing57 12 9 11
thing58 12 12 10
thing59 6 13 10
thing6 8 5 7
thing60 12 9 6
thing61 5 9 9
thing62 8 10 8
... ... ...
[100 rows x 3 columns]
You can then do a 'groupby' like this:
In [69]: store = pd.HDFStore('data.h5')
In [61]: dim1 = Index(store.select_column('df','dim1').unique())
In [66]: store.close()
In [67]: groups = dim1[0:10]
In [68]: groups
Out[68]: Index([u'thing28', u'thing62', u'thing64', u'thing33', u'thing46', u'thing2', u'thing1', u'thing42', u'thing56', u'thing99'], dtype='object')
In [70]: pd.read_hdf('data.h5','df',where='dim1=groups').apply(lambda x: x.value_counts())
Out[70]:
dim1 dim2 dim3
thing1 14 2 1
thing10 NaN 1 1
thing11 NaN 1 2
thing12 NaN 5 NaN
thing13 NaN 1 NaN
thing14 NaN 1 1
thing15 NaN 1 1
thing16 NaN 1 3
thing17 NaN NaN 2
thing18 NaN 1 1
thing19 NaN 1 2
thing2 9 1 1
thing20 NaN 2 NaN
thing21 NaN NaN 1
thing22 NaN 2 2
thing23 NaN 2 3
thing24 NaN 2 1
thing25 NaN 3 2
thing26 NaN 2 2
thing27 NaN 3 1
thing28 11 NaN NaN
thing29 NaN 1 2
thing30 NaN 2 NaN
thing31 NaN 1 1
thing32 NaN 1 1
thing33 13 1 2
thing34 NaN 1 NaN
thing35 NaN 1 NaN
thing36 NaN 1 1
thing37 NaN 1 2
thing38 NaN 3 NaN
thing39 NaN 3 1
thing4 NaN 2 NaN
thing41 NaN NaN 1
thing42 9 1 1
thing43 NaN NaN 1
thing44 NaN 1 2
thing45 NaN NaN 2
thing46 12 NaN 1
thing47 NaN 1 1
thing48 NaN 1 NaN
thing49 NaN 1 NaN
thing5 NaN 2 2
thing50 NaN NaN 3
thing51 NaN 2 2
thing52 NaN 1 3
thing53 NaN 2 4
thing55 NaN NaN 2
thing56 12 1 1
thing57 NaN NaN 3
thing58 NaN 1 2
thing6 NaN NaN 1
thing60 NaN 1 1
thing61 NaN 1 4
thing62 8 2 1
thing63 NaN 1 1
thing64 15 NaN 1
thing66 NaN 1 2
thing67 NaN 2 NaN
thing68 NaN 1 1
... ... ...
[90 rows x 3 columns]