I have a data frame in the following form
name v1 v2 v3
x 1 4 7
y 2 5 8
z 3 6 9
I want to multiply each value in the middle two columns by the value in the final column, output would be:
name v1 v2 v3
x 7 28 7
y 16 40 8
z 27 54 9
My current attempt is giving an error, Index object has no attribute apply
df[df.columns[1:-2]] = df.columns[1:-2].apply(lambda x : (x*df.columns[-1])
You can use iloc for selecting by position with mul:
print (df.iloc[:, 1:-1])
v1 v2
0 1 4
1 2 5
2 3 6
df.iloc[:, 1:-1] = df.iloc[:, 1:-1].mul(df.iloc[:, -1], axis=0)
print (df)
name v1 v2 v3
0 x 7 28 7
1 y 16 40 8
2 z 27 54 9
Solution with selecting columns by names:
df[['v1','v2']] = df[['v1','v2']].mul(df['v3'], axis=0)
print (df)
name v1 v2 v3
0 x 7 28 7
1 y 16 40 8
2 z 27 54 9
Related
I have a dataset that looks like this but with several more binary outcome variables:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long ID byte(Region Type Tier Secure Offshore Highland)
120034 12 1 2 1 0 1
120035 12 1 2 1 0 1
120036 12 1 2 1 0 1
120037 12 1 2 1 0 1
120038 41 1 2 1 0 0
120039 41 2 2 1 1 0
120040 41 2 1 0 1 0
120041 41 2 1 0 1 0
120042 41 2 1 0 1 0
120043 41 2 1 0 0 .
120044 65 2 1 0 0 .
120045 65 3 1 0 0 0
120046 65 3 1 1 0 0
120047 65 3 2 1 1 0
120048 65 3 2 1 0 0
120049 65 3 2 . 1 1
120050 25 3 2 . 1 1
120051 25 5 2 . 1 1
120052 25 5 1 . 0 1
120053 25 5 2 . 0 .
120054 25 5 2 0 0 .
120055 25 5 1 0 . 0
120056 25 5 1 0 . 0
120057 95 7 1 0 1 0
120058 95 7 1 0 1 0
120059 95 7 1 1 1 0
120060 95 7 2 1 0 1
120061 95 7 2 1 0 1
120062 59 7 2 1 0 1
120063 95 8 2 0 . 1
120064 59 8 1 0 . 1
120065 59 8 1 0 . 0
120066 59 8 1 1 . 0
120067 59 8 1 1 1 0
120068 59 8 2 1 1 0
120069 40 9 2 1 1 1
120070 40 9 2 1 0 1
120071 40 9 2 1 0 1
120072 40 9 1 0 0 1
end
I am creating a table with the community-contributed command tabout:
foreach v of var Secure Offshore Highland{
tabout Region Type Tier `v' using `v'.docx ///
, replace ///
style(docx) font(bold) c(freq row) ///
f(1) layout(cb) landscape nlab(Obs) paper(A4)
}
It has both row frequencies, percentages and the totals.
However, I did not need all this information so i modified my code as follows:
foreach v of var Secure Offshore Highland{
tabout Region Type Tier `v' using `v'.docx ///
, replace ///
style(docx) font(bold) c(freq row) ///
f(1) layout(cb) h3(nil) h2(nil) dropc(2 3 4 5 7) landscape nlab(Obs) paper(A4)
}
This produces what I need but both versions of my code create three individual tables for each outcome variables. I have to manually make one table combining the three tables keeping the left-most column, the % of "1" column and the right-most column showing the row-total.
Can anyone help me out here regarding:
Merging all the tables in one go, keeping the exploratory variable labels on the left-most and the rowtotal on the right-most column.
Instead of deleting the columns except % of "1"s, I only want to have the desired column. Deleting columns seem so crude and dangerous.
Can i get this same output in Excel through "putexcel"? I tried following the wonderfully written blog by Chuck Huber. But I cannot figure out the "merging" part.
I came this far due to lots and lots of studying, especially Ian Watson's "User Guide for tabout Version 3" and Nicholas Cox's "How to face lists with fortitude".
Cross-posted on Statalist.
You cannot do this readily with tabout -- custom tables require custom programming.
My advice is to create a matrix with whatever values you need and then use the (also) community-contributed command esttab to tabulate and export everything.
That said, what you want requires a lot of work but here is a simplified example based on your data:
matrix N = J(1, 2, .)
local i 0
foreach v in Region Type Tier {
local i = `i' + 1
tabulate `v' Secure, matcell(A`i')
matrix arowsum = J(1, rowsof(A`i'), 1) * A`i'
matrix A`i' = A`i' \ arowsum
if `i' > 1 local N \ N
matrix m1a = (nullmat(m1a) `N' \ A`i')
}
local i 0
foreach v in Region Type Tier {
local i = `i' + 1
tabulate `v' Offshore, matcell(B`i')
matrix browsum = J(1, rowsof(B`i'), 1) * B`i'
matrix B`i' = B`i' \ browsum
if `i' > 1 local N \ N
matrix m2a = (nullmat(m2a) `N' \ B`i')
}
local i 0
foreach v in Region Type Tier {
local i = `i' + 1
tabulate `v' Highland, matcell(C`i')
matrix crowsum = J(1, rowsof(C`i'), 1) * C`i'
matrix C`i' = C`i' \ crowsum
if `i' > 1 local N \ N
matrix m3a = (nullmat(m3a) `N' \ C`i')
}
matrix m1b = m1a * J(colsof(m1a), 1, 1)
matrix m2b = m2a * J(colsof(m2a), 1, 1)
matrix m3b = m3a * J(colsof(m3a), 1, 1)
matrix M1 = m1a, m1b
matrix M2 = m2a, m2b
matrix M3 = m3a, m3b
matrix K = J(1, 3, .)
matrix M = M1 \ K \ M2 \ K \ M3
You can then use esttab to export the results in Excel or Word:
esttab matrix(M)
---------------------------------------------------
M
c1 c2 c1
---------------------------------------------------
r1 0 4 4
r2 3 0 3
r3 1 3 4
r4 4 2 6
r5 2 4 6
r6 2 3 5
r7 3 3 6
r1 15 19 34
r1 . . .
r1 0 5 5
r2 5 1 6
r3 1 3 4
r4 3 0 3
r5 2 4 6
r6 3 3 6
r7 1 3 4
r1 15 19 34
r1 . . .
r1 13 4 17
r2 2 15 17
r1 15 19 34
r1 . . .
r1 . . .
r1 4 0 4
r2 3 2 5
r3 3 1 4
r4 2 4 6
r5 1 2 3
r6 4 2 6
r7 2 3 5
r1 19 14 33
r1 . . .
r1 5 0 5
r2 2 4 6
r3 3 3 6
r4 3 1 4
r5 3 3 6
r6 0 2 2
r7 3 1 4
r1 19 14 33
r1 . . .
r1 6 7 13
r2 13 7 20
r1 19 14 33
r1 . . .
r1 . . .
r1 0 4 4
r2 2 3 5
r3 0 4 4
r4 5 0 5
r5 4 2 6
r6 4 1 5
r7 3 3 6
r1 18 17 35
r1 . . .
r1 1 4 5
r2 4 0 4
r3 4 2 6
r4 2 2 4
r5 3 3 6
r6 4 2 6
r7 0 4 4
r1 18 17 35
r1 . . .
r1 13 3 16
r2 5 14 19
r1 18 17 35
---------------------------------------------------
You will have to generate the rest of the elements you want separately (including column and row names etc.) but the idea is the same. You will also have to play with the options in esttab to fine tune the desired final outcome.
Note that the above can be written more efficiently but I have kept everything separate in this answer so you can understand it.
EDIT:
If you are working with matrices as above you can also use putexcel easily:
putexcel A1 = matrix(M)
I have two distance matrices with overlapping variable names.
dfA:
Start A1 A2 A3 A4 … A150
Location
A 12 4 12 2 9
B 5 2 19 4 3
C 1 4 8 7 12
dfB:
A B C
X 4 12 32
Y 1 6 12
Z 2 8,5 11
So from start A1, A2, etc. through ABC there are paths to X, Y and Z
I would like to see what is the shortest path for an item, for example the the combination A1 -> Z. I programmed this by loading csv's with the distance matrices and unstack them. Then with df.itterows() and two for loops loop through the possible combinations and see what the smallest is for the combination A1 -> Z.
But since i have to do this for around 30000 items, it takes way to long.
Anybody know how to do this in a vectorized way?
I added D so that the axis lengths will be different (dfB won't be square matrix) just for my convenience (it works with square matrices too).
import pandas as pd
import numpy as np
df_a = pd.read_csv('dfA.csv', delim_whitespace=True, index_col=0, decimal=",")
df_b = pd.read_csv('dfB.csv', delim_whitespace=True, index_col=0, decimal=",")
mat_a = df_a.values
mat_b = df_b.values
mat_a2 = np.expand_dims(mat_a, axis=2)
mat_b2 = np.expand_dims(mat_b.T, axis=1)
mat_a3 = np.tile(mat_a2, (1, 1, mat_b.shape[0]))
mat_b3 = np.tile(mat_b2, (1, mat_a.shape[1], 1))
tot = mat_a3 + mat_b3
ind = np.argmin(tot, axis=0).T
df_c = pd.DataFrame(df_b.columns.values[ind], columns=df_a.columns, index=df_b.index)
print(df_c)
dfA:
Start_Location A1 A2 A3 A4 A150
A 12 4 12 2 9
B 5 2 19 4 3
C 1 4 8 7 12
D 5 2 9 11 4
dfB:
A B C D
X 4 12 32 11,4
Y 1 6 2 9,3
Z 2 8,5 11 1,4
dfC:
A1 A2 A3 A4 A150
X A A A A A
Y C A C A B
Z D D D A D
I want to create a pivot table from a pandas dataframe
using dataframe.pivot()
and include not only dataframe columns but also the data within the dataframe index.
Couldn't find any docs that show how to do that.
Any tips?
Use reset_index to make the index a column:
In [45]: df = pd.DataFrame({'y': [0, 1, 2, 3, 4, 4], 'x': [1, 2, 2, 3, 1, 3]}, index=np.arange(6)*10)
In [46]: df
Out[46]:
x y
0 1 0
10 2 1
20 2 2
30 3 3
40 1 4
50 3 4
In [47]: df.reset_index()
Out[47]:
index x y
0 0 1 0
1 10 2 1
2 20 2 2
3 30 3 3
4 40 1 4
5 50 3 4
So pivot uses the index as values:
In [48]: df.reset_index().pivot(index='y', columns='x')
Out[48]:
index
x 1 2 3
y
0 0 NaN NaN
1 NaN 10 NaN
2 NaN 20 NaN
3 NaN NaN 30
4 40 NaN 50
I'm wondering how I can speed up a merge of two dataframes. One of the dataframes has time stamped data points (value col).
import pandas as pd
import numpy as np
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
The other has time interval information (start_time, end_time, and associated interval_id).
intervals = pd.DataFrame({'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
I'd like to merge these two dataframes more efficiently than the for loop below:
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) & \
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
I keep imagining I'll be able to use pandas time series functionality, like a date range or TimeGrouper, but I have yet to figure out anything more pythonic (pandas-y?) than the above.
Example result:
time value interval_id start_time end_time
0 0.575976 0.022727 NaN NaN NaN
1 4.607545 0.222568 0 3.618715 8.294847
2 5.179350 0.438052 0 3.618715 8.294847
3 11.069956 0.641269 1 10.301728 19.870283
4 12.387854 0.344192 1 10.301728 19.870283
5 18.889691 0.582946 1 10.301728 19.870283
6 20.850469 -0.027436 NaN NaN NaN
7 23.199618 0.731316 2 21.488868 28.968338
8 26.631284 0.570647 2 21.488868 28.968338
9 26.996397 0.597035 2 21.488868 28.968338
10 28.601867 -0.131712 2 21.488868 28.968338
11 28.660986 0.710856 2 21.488868 28.968338
12 28.875395 -0.355208 2 21.488868 28.968338
13 28.959320 -0.430759 2 21.488868 28.968338
14 29.702800 -0.554742 NaN NaN NaN
Any suggestions from time series-savvy people out there would be greatly appreciated.
Update, after Jeff's answer:
The main problem is that interval_id has no relation to any regular time interval (e.g., intervals are not always approximately 10 seconds). One interval could be 10 seconds, the next could be 2 seconds, and the next could be 100 seconds, so I can't use any regular rounding scheme as Jeff proposed. Unfortunately, my minimal example above does not make that clear.
You could use np.searchsorted to find the indices representing where each value in data['time'] would fit between intervals['start_time']. Then you could call np.searchsorted again to find the indices representing where each value in data['time'] would fit between intervals['end_time']. Note that using np.searchsorted relies on interval['start_time'] and interval['end_time'] being in sorted order.
For each corresponding location in the arrays, where these two indices are equal, data['time'] fits in between interval['start_time'] and interval['end_time']. Note that this relies on the intervals being disjoint.
Using searchsorted in this way is about 5 times faster than using the for-loop:
import pandas as pd
import numpy as np
np.random.seed(1)
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
intervals = pd.DataFrame(
{'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
def using_loop():
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) & \
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
return result
def using_searchsorted():
start_idx = np.searchsorted(intervals['start_time'].values, data['time'].values)-1
end_idx = np.searchsorted(intervals['end_time'].values, data['time'].values)
mask = (start_idx == end_idx)
result = data.copy()
result['interval_id'] = result['start_time'] = result['end_time'] = np.nan
result['interval_id'][mask] = start_idx
result.ix[mask, 'start_time'] = intervals['start_time'][start_idx[mask]].values
result.ix[mask, 'end_time'] = intervals['end_time'][end_idx[mask]].values
return result
In [254]: %timeit using_loop()
100 loops, best of 3: 7.74 ms per loop
In [255]: %timeit using_searchsorted()
1000 loops, best of 3: 1.56 ms per loop
In [256]: 7.74/1.56
Out[256]: 4.961538461538462
you may want to have the intervals of 'time' specified slightly different, but should give you a start.
In [34]: data['on'] = np.round(data['time']/10)
In [35]: data.merge(intervals,left_on=['on'],right_on=['interval_id'],how='outer')
Out[35]:
time value on end_time interval_id start_time
0 1.301658 -0.462594 0 7.630243 0 0.220746
1 2.202654 0.054903 0 7.630243 0 0.220746
2 10.253593 0.329947 1 17.715596 1 10.299464
3 13.803064 -0.601021 1 17.715596 1 10.299464
4 17.086290 0.484119 2 27.175455 2 24.710704
5 21.797655 0.988212 2 27.175455 2 24.710704
6 26.265165 0.491410 3 37.702968 3 30.670753
7 27.777182 -0.121691 3 37.702968 3 30.670753
8 34.066473 0.659260 3 37.702968 3 30.670753
9 34.786337 -0.230026 3 37.702968 3 30.670753
10 35.343021 0.364505 4 49.489028 4 42.948486
11 35.506895 0.953562 4 49.489028 4 42.948486
12 36.129951 -0.703457 4 49.489028 4 42.948486
13 38.794690 -0.510535 4 49.489028 4 42.948486
14 40.508702 -0.763417 4 49.489028 4 42.948486
15 43.974516 -0.149487 4 49.489028 4 42.948486
16 46.219554 0.893025 5 57.086065 5 53.124795
17 50.206860 0.729106 5 57.086065 5 53.124795
18 50.395082 -0.807557 5 57.086065 5 53.124795
19 50.410783 0.996247 5 57.086065 5 53.124795
20 51.602892 0.144483 5 57.086065 5 53.124795
21 52.006921 -0.979778 5 57.086065 5 53.124795
22 52.682896 -0.593500 5 57.086065 5 53.124795
23 52.836037 0.448370 5 57.086065 5 53.124795
24 53.052130 -0.227245 5 57.086065 5 53.124795
25 57.169775 0.659673 6 65.927106 6 61.590948
26 59.336176 -0.893004 6 65.927106 6 61.590948
27 60.297771 0.897418 6 65.927106 6 61.590948
28 61.151664 0.176229 6 65.927106 6 61.590948
29 61.769023 0.894644 6 65.927106 6 61.590948
30 64.221220 0.893012 6 65.927106 6 61.590948
31 67.907417 -0.859734 7 78.192671 7 72.463468
32 71.460483 -0.271364 7 78.192671 7 72.463468
33 74.514028 0.621174 7 78.192671 7 72.463468
34 75.822643 -0.351684 8 88.820139 8 83.183825
35 84.252778 -0.685043 8 88.820139 8 83.183825
36 84.838361 0.354365 8 88.820139 8 83.183825
37 85.770611 -0.089678 9 NaN NaN NaN
38 85.957559 0.649995 9 NaN NaN NaN
39 86.498339 0.569793 9 NaN NaN NaN
40 91.006735 0.731006 9 NaN NaN NaN
41 91.941862 0.964376 9 NaN NaN NaN
42 94.617522 0.626889 9 NaN NaN NaN
43 95.318288 -0.088918 10 NaN NaN NaN
44 95.595243 0.539685 10 NaN NaN NaN
45 95.818267 -0.989647 10 NaN NaN NaN
46 98.240444 0.931445 10 NaN NaN NaN
47 98.722869 0.442502 10 NaN NaN NaN
48 99.349198 0.585264 10 NaN NaN NaN
49 99.829372 -0.743697 10 NaN NaN NaN
[50 rows x 6 columns]
I have a data structure that consists of a three-level nested dict that keeps counts of occurrences of a three part object. I'd like to build a DataFrame out of it with a specific shape, but I can't figure out a way to do it that doesn't involve consuming a lot of working memory---because the table is quite large (several GBs at full extent).
The basic functionality looks like this:
class SparseCubeTable:
def __init__(self):
self.table = {}
self.dim1 = []
self.dim2 = []
self.dim3 = []
def increment(self, dim1, dim2, dim3):
if dim1 in self.table:
if dim2 in self.table[dim1]:
if dim3 in self.table[dim1][dim2]:
self.table[dim1][dim2][dim3] += 1
else:
self.dim3.append(dim3)
self.table[dim1][dim2][dim3] = 1
else:
self.dim2.append(dim2)
self.dim3.append(dim3)
self.table[dim1][dim2] = {dim3:1}
else:
self.dim1.append(dim1)
self.dim2.append(dim2)
self.dim3.append(dim3)
self.table[dim1] = {dim2:{dim3:1}}
This was constructed to make summing over keys easier, among other things. A SparseCubeTable is used like this:
In [23]: example = SparseCubeTable()
In [24]: example.increment("thing1", "thing2", "thing3")
In [25]: example.increment("thing1", "thing2", "thing3")
In [26]: example.increment("thing4", "thing5", "thing6")
In [27]: example.increment("thing1", "thing3", "thing5")
And you can get the data like this:
In [29]: example.table['thing1']['thing2']['thing3']
Out[29]: 2
The sort of DataFrame I want looks like this:
1 2 3 4
thing1 thing2 thing3 2
thing1 thing3 thing5 1
thing4 thing5 thing6 1
The DataFrame is going to be saved as an HDF5 db with columns 1-3 indexed and statistical transformations on column 4 (that require the whole table be temporarily in memory).
The problem is that the pandas.DataFrame.from_dict function builds a whole other sort of structure with the keys used as row labels, as far as I understand it. However, trying to use from_records forces me to copy out the whole data structure into a list, meaning that I now have double the memory size to worry about.
I tried implementing the solution in:
Create a pandas DataFrame from generator?
but in 0.12.0 what it ends up doing is first building a giant list of strings which is even worse. I assume writing out the structure to a csv and reading it back in is also going to be terrible on memory.
Is there a better way of doing this? Or should I just try to squeeze memory even further in the SparseCubeTable somehow? It seems so wasteful to have to build an intermediate list data structure to use from_records.
Here is a code for an efficient solution.
Create some data looking like yours. This is a list of 1000 3-tuples
In [1]: import random
In [2]: tags = [ 'thing{0}'.format(i) for i in xrange(100) ]
In [3]: data = [ (random.choice(tags),random.choice(tags),random.choice(tags)) for i in range(1000) ]
Our writing function, makes sure that when we write the index is globally unique (its not actually necessary, but since the index is actually written its 'nicer')
In [4]: def write(store,c):
...: df = DataFrame(c,columns=['dim1','dim2','dim3'])
...: try:
...: nrows = store.get_storer('df').nrows
...: except:
...: nrows = 0
...: df.index += nrows
...: store.append('df',df,data_columns=True)
...: return []
...:
In [5]: collector = []
In [6]: store = pd.HDFStore('data.h5',mode='w')
Iterate thru your data (or from a stream or whatever), and write it.
In [7]: for i, d in enumerate(data):
...: collector.append(d)
...: if i % 100 == 0 and i:
...: collector = write(store,collector)
...:
In [8]: write(store,collector)
Out[8]: []
The store
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/df frame_table (typ->appendable,nrows->1000,ncols->3,indexers->[index],dc->[dim1,dim2,dim3])
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/df frame_table (typ->appendable,nrows->1000,ncols->3,indexers->[index],dc->[dim1,dim2,dim3])
In [10]: store.select('df')
Out[10]:
dim1 dim2 dim3
0 thing28 thing87 thing29
1 thing62 thing70 thing50
2 thing64 thing12 thing98
3 thing33 thing98 thing46
4 thing46 thing5 thing76
5 thing2 thing9 thing21
6 thing1 thing63 thing68
7 thing42 thing30 thing45
8 thing56 thing71 thing77
9 thing99 thing10 thing91
10 thing40 thing9 thing10
11 thing70 thing54 thing59
12 thing94 thing65 thing3
13 thing93 thing24 thing25
14 thing95 thing94 thing86
15 thing41 thing55 thing3
16 thing88 thing10 thing47
17 thing89 thing58 thing33
18 thing16 thing66 thing55
19 thing68 thing20 thing99
20 thing34 thing71 thing28
21 thing67 thing87 thing97
22 thing77 thing74 thing6
23 thing63 thing41 thing30
24 thing14 thing62 thing66
25 thing20 thing36 thing67
26 thing33 thing19 thing58
27 thing0 thing71 thing24
28 thing1 thing48 thing42
29 thing18 thing12 thing4
30 thing85 thing97 thing20
31 thing73 thing71 thing70
32 thing91 thing43 thing48
33 thing45 thing6 thing87
34 thing0 thing28 thing8
35 thing56 thing38 thing61
36 thing39 thing92 thing35
37 thing69 thing26 thing22
38 thing16 thing16 thing79
39 thing4 thing16 thing12
40 thing81 thing79 thing1
41 thing77 thing90 thing83
42 thing53 thing17 thing89
43 thing53 thing15 thing37
44 thing25 thing7 thing20
45 thing44 thing14 thing25
46 thing62 thing84 thing23
47 thing83 thing50 thing60
48 thing68 thing64 thing24
49 thing73 thing53 thing43
50 thing86 thing67 thing31
51 thing75 thing63 thing82
52 thing8 thing10 thing90
53 thing34 thing23 thing12
54 thing66 thing97 thing26
55 thing66 thing53 thing27
56 thing79 thing22 thing37
57 thing43 thing82 thing66
58 thing87 thing53 thing92
59 thing33 thing71 thing97
... ... ...
[1000 rows x 3 columns]
In [11]: store.close()
Then you can do interesting things. If you are not reading the entire set in you may want to chunk this (which is a bit more involved if you are counting things).
In [56]: pd.read_hdf('data.h5','df').apply(lambda x: x.value_counts())
Out[56]:
dim1 dim2 dim3
thing0 12 6 8
thing1 14 7 8
thing10 10 10 7
thing11 8 10 14
thing12 11 14 11
thing13 11 12 7
thing14 8 14 3
thing15 12 11 11
thing16 7 10 11
thing17 16 9 13
thing18 13 8 10
thing19 11 7 8
thing2 9 5 17
thing20 6 7 11
thing21 7 8 8
thing22 4 17 14
thing23 14 11 7
thing24 10 5 14
thing25 11 11 12
thing26 13 10 15
thing27 12 15 16
thing28 11 10 8
thing29 7 7 8
thing3 11 14 14
thing30 11 16 8
thing31 7 6 12
thing32 8 12 9
thing33 13 12 12
thing34 12 8 5
thing35 6 10 8
thing36 6 9 13
thing37 8 10 12
thing38 7 10 4
thing39 14 11 7
thing4 9 7 10
thing40 12 8 9
thing41 8 16 11
thing42 9 11 13
thing43 8 6 13
thing44 9 13 11
thing45 7 13 7
thing46 12 8 13
thing47 9 10 9
thing48 8 9 9
thing49 4 8 7
thing5 13 7 7
thing50 14 12 9
thing51 5 7 11
thing52 9 11 12
thing53 9 15 15
thing54 7 9 13
thing55 6 10 10
thing56 12 11 11
thing57 12 9 11
thing58 12 12 10
thing59 6 13 10
thing6 8 5 7
thing60 12 9 6
thing61 5 9 9
thing62 8 10 8
... ... ...
[100 rows x 3 columns]
You can then do a 'groupby' like this:
In [69]: store = pd.HDFStore('data.h5')
In [61]: dim1 = Index(store.select_column('df','dim1').unique())
In [66]: store.close()
In [67]: groups = dim1[0:10]
In [68]: groups
Out[68]: Index([u'thing28', u'thing62', u'thing64', u'thing33', u'thing46', u'thing2', u'thing1', u'thing42', u'thing56', u'thing99'], dtype='object')
In [70]: pd.read_hdf('data.h5','df',where='dim1=groups').apply(lambda x: x.value_counts())
Out[70]:
dim1 dim2 dim3
thing1 14 2 1
thing10 NaN 1 1
thing11 NaN 1 2
thing12 NaN 5 NaN
thing13 NaN 1 NaN
thing14 NaN 1 1
thing15 NaN 1 1
thing16 NaN 1 3
thing17 NaN NaN 2
thing18 NaN 1 1
thing19 NaN 1 2
thing2 9 1 1
thing20 NaN 2 NaN
thing21 NaN NaN 1
thing22 NaN 2 2
thing23 NaN 2 3
thing24 NaN 2 1
thing25 NaN 3 2
thing26 NaN 2 2
thing27 NaN 3 1
thing28 11 NaN NaN
thing29 NaN 1 2
thing30 NaN 2 NaN
thing31 NaN 1 1
thing32 NaN 1 1
thing33 13 1 2
thing34 NaN 1 NaN
thing35 NaN 1 NaN
thing36 NaN 1 1
thing37 NaN 1 2
thing38 NaN 3 NaN
thing39 NaN 3 1
thing4 NaN 2 NaN
thing41 NaN NaN 1
thing42 9 1 1
thing43 NaN NaN 1
thing44 NaN 1 2
thing45 NaN NaN 2
thing46 12 NaN 1
thing47 NaN 1 1
thing48 NaN 1 NaN
thing49 NaN 1 NaN
thing5 NaN 2 2
thing50 NaN NaN 3
thing51 NaN 2 2
thing52 NaN 1 3
thing53 NaN 2 4
thing55 NaN NaN 2
thing56 12 1 1
thing57 NaN NaN 3
thing58 NaN 1 2
thing6 NaN NaN 1
thing60 NaN 1 1
thing61 NaN 1 4
thing62 8 2 1
thing63 NaN 1 1
thing64 15 NaN 1
thing66 NaN 1 2
thing67 NaN 2 NaN
thing68 NaN 1 1
... ... ...
[90 rows x 3 columns]