Selecting data from an HDFStore by floating-point data_column - python-2.7

I have a table in an HDFStore with a column of floats f stored as a data_column. I would like to select a subset of rows where, e.g., f==0.6.
I'm running in to trouble that I'm assuming is related to a floating-point precision mismatch somewhere. Here is an example:
In [1]: f = np.arange(0, 1, 0.1)
In [2]: s = f.astype('S')
In [3]: df = pd.DataFrame({'f': f, 's': s})
In [4]: df
Out[4]:
f s
0 0.0 0.0
1 0.1 0.1
2 0.2 0.2
3 0.3 0.3
4 0.4 0.4
5 0.5 0.5
6 0.6 0.6
7 0.7 0.7
8 0.8 0.8
9 0.9 0.9
[10 rows x 2 columns]
In [5]: with pd.get_store('test.h5', mode='w') as store:
...: store.append('df', df, data_columns=True)
...:
In [6]: with pd.get_store('test.h5', mode='r') as store:
...: selection = store.select('df', 'f=f')
...:
In [7]: selection
Out[7]:
f s
0 0.0 0.0
1 0.1 0.1
2 0.2 0.2
4 0.4 0.4
5 0.5 0.5
8 0.8 0.8
9 0.9 0.9
[7 rows x 2 columns]
I would like the query to return all of the rows but instead several are missing. A query with where='f=0.3' returns an empty table:
In [8]: with pd.get_store('test.h5', mode='r') as store:
selection = store.select('df', 'f=0.3')
...:
In [9]: selection
Out[9]:
Empty DataFrame
Columns: [f, s]
Index: []
[0 rows x 2 columns]
I'm wondering whether this is the intended behavior, and if so is there is a simple workaround, such as setting a precision limit for floating-point queries in pandas? I'm using version 0.13.1:
In [10]: pd.__version__
Out[10]: '0.13.1-55-g7d3e41c'

I don't think so, no. Pandas is built around numpy, and I have never seen any tools for approximate float equality except testing utilities like assert_allclose, and that won't help here.
The best you can do is something like:
In [17]: with pd.get_store('test.h5', mode='r') as store:
selection = store.select('df', '(f > 0.2) & (f < 0.4)')
....:
In [18]: selection
Out[18]:
f s
3 0.3 0.3
If this is a common idiom for you, make a function for it. You can even get fancy by incorporating numpy float precision.

Related

How do I plot data in a text file depending on the the value present in one of the columns

I have a text file with with a header and a few columns, which represents results of experiments where some parameters were fixed to obtain some metrics. the file is he following format :
A B C D E
0 0.5 0.2 0.25 0.75 1.25
1 0.5 0.3 0.12 0.41 1.40
2 0.5 0.4 0.85 0.15 1.55
3 1.0 0.2 0.11 0.15 1.25
4 1.0 0.3 0.10 0.11 1.40
5 1.0 0.4 0.87 0.14 1.25
6 2.0 0.2 0.23 0.45 1.55
7 2.0 0.3 0.74 0.85 1.25
8 2.0 0.4 0.55 0.55 1.40
So I want to plot x = B, y = C for each fixed value of And E so basically for an E=1.25 I want a series of line plots of x = B, y = C at each value of A then a plot for each unique value of E.
Anyone could help with this?
You could do a combination of groupby() and seaborn.lineplot():
for e,d in df.groupby('E'):
fig, ax = plt.subplots()
sns.lineplot(data=d, x='B', y='C', hue='A', ax=ax)
ax.set_title(e)

Getting 'ValueError: x and y must be 1D arrays of the same length' when they are in fact 1D arrays of same length

I have this dataframe:
key variable value
0 0.25 -0.2 606623.455859
1 0.27 -0.2 621462.029200
2 0.30 -0.2 640299.078053
3 0.33 -0.2 653686.910706
4 0.35 -0.2 659278.593742
5 0.37 -0.2 665684.466383
6 0.40 -0.2 671975.695814
7 0.25 0 530091.733402
8 0.27 0 542501.852937
9 0.30 0 557799.179433
10 0.33 0 571140.149887
11 0.35 0 575117.783803
12 0.37 0 582709.048163
13 0.40 0 588168.965913
14 0.25 0.2 466275.721535
15 0.27 0.2 478678.452615
16 0.30 0.2 492749.041489
17 0.33 0.2 500792.917910
18 0.35 0.2 503620.638204
19 0.37 0.2 507884.996510
20 0.40 0.2 512504.976664
21 0.25 0.5 351579.595889
22 0.27 0.5 359555.855803
23 0.30 0.5 368924.362358
24 0.33 0.5 375069.238800
25 0.35 0.5 377847.414729
26 0.37 0.5 381146.573247
27 0.40 0.5 383836.933547
And I am trying to make a contour plot using this dataframe with the following code:
x = df['key'].values
y = df['variable'].values
z = df['value'].values
plt.tricontourf(x, y, z, colors='k')
I keep getting this error:
ValueError: x and y must be 1D arrays of the same length
But whenever I check the len, .size, .shape, and .ndim of x and y, they are 1D arrays of the same length. Does anyone know why I would get this error?
x.shape returns (28L,) and y.shape returns (28L,) as well
Okay I found a way to make it work. Really not sure why it didn't work the original way because I was feeding tricontourf 1D arrays, but basically I wrarpped my data in a list() function just to double make sure it was 1D arrays. This made it work. Here's the code:
x = df_2020_pivot['key'].values
y = df_2020_pivot['variable'].values
z = df_2020_pivot['value'].values
plt.tricontourf(list(x), list(y), list(z))
plt.show()
And this is what it produced
I had the same issue crop up. I was passing in two numpy arrays of the same length, and got the 'must be 1D arrays of same length' error. Looking at type(array), the arrays I was passing in were numpy.ndarrays. I used array.tolist() to turn them into simple (1D) lists, and this removed the error for me. Wrapping in the list() function as mentioned above also works.
x = df['key'].values.tolist()
y = df['variable'].values.tolist()
z = df['value'].values
plt.tricontourf(x, y, z, colors='k')

Find Indexes of Non-NaN Values in Pandas DataFrame

I have a very large dataset (roughly 200000x400), however I have it filtered and only a few hundred values remain, the rest are NaN. I would like to create a list of indexes of those remaining values. I can't seem to find a simple enough solution.
0 1 2
0 NaN NaN 1.2
1 NaN NaN NaN
2 NaN 1.1 NaN
3 NaN NaN NaN
4 1.4 NaN 1.01
For instance, I would like a list of [(0,2), (2,1), (4,0), (4,2)].
Convert the dataframe to it's equivalent NumPy array representation and check for NaNs present. Later, take the negation of it's corresponding indices (indicating non nulls) using numpy.argwhere. Since the output required must be a list of tuples, you could then make use of generator map function applying tuple as function to every iterable of the resulting array.
>>> list(map(tuple, np.argwhere(~np.isnan(df.values))))
[(0, 2), (2, 1), (4, 0), (4, 2)]
assuming that your column names are of int dtype:
In [73]: df
Out[73]:
0 1 2
0 NaN NaN 1.20
1 NaN NaN NaN
2 NaN 1.1 NaN
3 NaN NaN NaN
4 1.4 NaN 1.01
In [74]: df.columns.dtype
Out[74]: dtype('int64')
In [75]: df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
Out[75]: [(0, 2), (2, 1), (4, 0), (4, 2)]
if your column names are of object dtype:
In [81]: df.columns.dtype
Out[81]: dtype('O')
In [83]: df.stack().reset_index().astype(int).drop(0,1).apply(tuple, axis=1).tolist()
Out[83]: [(0, 2), (2, 1), (4, 0), (4, 2)]
Timing for 50K rows DF:
In [89]: df = pd.concat([df] * 10**4, ignore_index=True)
In [90]: df.shape
Out[90]: (50000, 3)
In [91]: %timeit list(map(tuple, np.argwhere(~np.isnan(df.values))))
10 loops, best of 3: 144 ms per loop
In [92]: %timeit df.stack().reset_index().drop(0, 1).apply(tuple, axis=1).tolist()
1 loop, best of 3: 1.67 s per loop
Conclusion: the Nickil Maveli's solution is 12 times faster for this test DF

Calculation on groups after group by with Pandas

I have a data frame that is grouped by 2 columns - Date And Client and I sum the amount so:
new_df = df.groupby(['Date',Client'])
Now I get the following df:
Sum
Date Client
1/1 A 0.8
B 0.2
1/2 A 0.1
B 0.9
I want to be able to catch the fact that there is a high fluctuation between the ratio of 0.8 to 0.2 that changed to 0.1 to 0.9. What would be the most efficient way to do it? Also I can't access the Date and Client fields when I try to do
new_df[['Date','Client']]
Why is that?
IIUC you can use pct_change or diff:
new_df = df.groupby(['Date','Client'], as_index=False).sum()
print (new_df)
Date Client Sum
0 1/1 A 0.8
1 1/1 B 0.2
2 1/2 A 0.1
3 1/2 B 0.9
new_df['pct_change'] = new_df.groupby('Date')['Sum'].pct_change()
new_df['diff'] = new_df.groupby('Date')['Sum'].diff()
print (new_df)
Date Client Sum pct_change diff
0 1/1 A 0.8 NaN NaN
1 1/1 B 0.2 -0.75 -0.6
2 1/2 A 0.1 NaN NaN
3 1/2 B 0.9 8.00 0.8

Django 1.6 Query Math Incorrect

No sure why, but the context['user_activity_percentage'] is showing 0 when it should be showing 25. This is because context['user_activity'] is 1 and it is int(1/4 * 100) = 25. I verified this in the manage.py shell_plus. Why is it showing 0 instead of 25?
context['user_activity'] = CommunityProfile.list_all_users.date_search(
date1, date2, column="last_activity").count()
context['user_activity_percentage'] = int(context['user_activity']/
CommunityProfile.objects.count() * 100)
If you are using Python 2.x, 1/4 is 0, not 0.25:
>>> 1 / 4
0
If you want to get 0.25, convert one of the value to float:
>>> float(1) / 4
0.25
This behavior is different from Python 3.x's (PEP-238: True division). If you want / works like Python 3.x, do the following:
>>> from __future__ import division
>>> 1 / 4
0.25